Computers & systematic biology workshop

Michael Walker walker at GENBANK.BIO.NET
Sat Feb 24 17:49:39 EST 1990

       Artificial Intelligence and Modern Computer Methods
            in Systematic Biology (ARTISYST Workshop)
The Systematic Biology Program of the National Science Foundation, is
sponsoring a Workshop on Artificial Intelligence, Expert Systems, and
Modern Computer Methods in Systematic Biology, to be held September 9
to 14, 1990, at the University of California, Davis.  There will be
about 45 participants representing an even mixture of biologists and
computer scientists.
Expenses for participants will be paid, including hotel (paid directly
by the workshop organizers), food (per diem of US $35), and travel
(with a maximum of US $500 for travel expenses).  Attendance at the
workshop is by invitation only.
These are the subject areas for the workshop:
1.  Scientific workstations for systematics;
2.  Expert systems, expert workstations and other tools for identification;
3.  Phylogenetic inference and mapping characters onto tree topologies;
4.  Literature data extraction and geographical data;
5.  Machine vision and feature extraction applied to systematics.
The workshop will examine state-of-the-art computing methods and
particularly Artificial Intelligence methods and the possibilities
they offer for applications in systematics.  Methods for knowledge
representation as they apply to systematics will be a central focus of
the workshop.  This meeting will provide systematists the opportunity
to make productive contacts with computer scientists interested in
these applications.  It will consist of tutorials, lectures on
problems and approaches in each area, working groups and discussion
periods, and demonstrations of relevant software.
Participants will present their previous or proposed research in a
lecture, in a poster session, or in a software demonstration session.
In addition, some participants will present tutorials in their area of
Preference will be given to applicants who are most likely to continue
active research and teaching in this area.  The Workshop organizers
welcome applications from all qualified biologists and computer
scientists, and strongly encourage women, minorities, and persons with
disabilities to apply.
If you are interested in participating, please apply by sending to the
workshop organizers the information suggested below:
1) your name, address, telephone number, and eventually your electronic mail
2) whether you apply as a computer scientist or as a biologist;
3) a short resume; 
4) a description of your previous work related to the workshop topic; 
5) a description of your planned research and how it relates to the workshop;
6) whether you, as a biologist (or as a computer scientist), have
taken or would like to take steps to establish permanent collaboration
with computer scientists (or biologists).  A total of two pages or
less is preferred.  This material will be the primary basis for
selecting workshop participants.
If you have software that you would like to demonstrate at the
workshop, please give a brief description, and indicate the hardware
that you need to run the program.  Several PC's and workstations will
be available at the workshop.
Mail your completed application to:
Renaud Fortuner, ARTISYST Workshop Chairman, 
California Department of Food and Agriculture
Analysis & Identification, room 340
P.O. Box 942871
Sacramento, CA 94271-0001
(916) 445-4521
E-mail: rfortuner at
Notification of acceptance of proposal will be made before May 31, 1990 
For further information, contact Renaud Fortuner, Michael Walker,
Program Chairman, (Walker at, or a member of the
steering committee:
Jim Diederich, U.C. Davis (dieder at
Jack Milton, U.C. Davis (milton at
Peter Cheeseman, NASA AMES (cheeseman at
Eric Horvitz, Stanford University (horvitz at
Julian Humphries, Cornell University (lqyy at crnlvax5.bitnet)
George Lauder, U.C Irvine (glauder at UCIvmsa.bitnet)
F. James Rohlf, SUNY (rohlf at sbbiovm.bitnet)
James Woolley, Texas A&M University (woolley at tamento.bitnet)
The five subject areas selected for the workshop are described in more
detail below.
                         James Diederich
                           Jack Milton
                    Department of Mathematics
                    University of California
                         Davis, CA 95616
     Recent advances in computing technology are bringing greatly
increased computing power to the desk top of the practicing
systematist for prices that were unheard of only a few years ago.  For
example, in mid 1989 one could expect to purchase a 10 to 15 MIPS
(million instructions per second) workstation for about $15,000.  This
machine might have eight megabytes of memory and possess a several
hundred megabyte hard disk of its own or be networked to a large file
server.  Currently workstation manufacturers seem committed to
doubling the performance of their workstations and halving the price
each year.  The continuing dramatic increase in computing power is
making compute-intensive software, much of which was until recently
only in the domain of mainframe computer users, available and working
in a responsive manner on the desk top.  We believe that in the near
future this newly available computing power will bring such
capabilities to the systematist as networked heterogeneous databases,
sophisticated three dimensional modeling capabilities and
corresponding databases, and semi-automated assistance in tasks
requiring specialized expertise.  It seems clear that these advances
will change the way the systematist works.
     One approach in the area of semi-automated problem solving using
modern computing methods involves capturing domain expertise in the
form of rules, definitions, classifications, and the like.  In
conjunction with this a mechanism for carrying out some form of
reasoning (inference) is provided.  In some cases the goal is to mimic
the behavior of an expert, while in other cases the goal is more
modest in that the system does not try to behave exactly as an expert
would.  Systems implemented under this philosophy are called expert
systems and are typically used only by experts.
     A different approach, which we take in our work, is to provide a
set of tools to assist the scientist (possibly a non-expert in the
field and in computer expertise) in carrying out his/her activities.
We call such a collection of tools an expert workstation.  Some tools
may, and often will, be based on knowledge of the domain, and
certainly some tools could be expert systems themselves.  However, the
expert workstation approach does not try to mimic expertise and
usually will not be considered as a replacement for expertise.  For
example, a saw, a hammer, and a chisel form a set of tools to be used
by an expert carpenter, but they do not in themselves replace the
carpenter's expertise.  Exactly how the tools are used will depend on
the expertise of the user within the domain as well as on the user's
expertise with the tools.  Again, the analogy can be drawn between the
expert carpenter's use of tools vis-a-vis an apprentice's use.  The
"set of tools" approach to handling knowledge representation and
inference on specialized problems on a workstation lends itself well
to incorporating a broad set of tools, the collection of which is very
flexible, with interactions not limited to the vision of an expert
system designer.
     The challenge for systematics seems to be to determine how to
best exploit the new technology within the constraints of the
resources available.  In particular, how do we go about coordinating a
diverse set of tasks on the workstation and making a rich scientific
computing environment available and productive to scientists with a
widely varying range both of knowledge about the domain as well as
interest in modern workstations.  Also some disciplines, such as
computer aided design (CAD), have used computer tools for many years,
and the existence of a large set of disparate tools that do not work
well together is a particularly vexing problem.  It is an advantage
for systematics that we are not saddled with the problem of resolving
many different existing standards, but the experience in areas such as
CAD indicates that it is of critical importance for other areas to pay
careful attention to tool coordination and standards from the outset.
We consider it important to address the question of which tools need
to be developed that provide support for fundamental activities in
systematics research, have reasonably wide appeal, have long
lifetimes, and form the basis for future developments.
     During the workshop we will have an early panel on "biological
tool frameworks".  During this panel and continuing into other
sessions we will ask participants about the systematists requirements
-- what do you want and need to be able to do at a workstation, what
is a tool and what characteristics of tools emanate from the potential
uses, how should tools work together, and what are reasonable
standards for linking and managing possibly disparate tools?
                        James B. Woolley
                      Texas A&M University
                    Department of Entomology
                   College Station Texas 77843
     Biologists use the terms identification and classification
somewhat differently than computer scientists.  Classification to
biologists is the process of constructing taxonomies (or the ordering
or organisms into groups based on their relationships), and
identification is the process of assigning an unknown specimen a place
in an existing classification.  Given this distinction, the general
areas of identification and diagnosis are certainly familiar to
workers in artificial intelligence.  Expert systems for diagnosis of
diseases, for example, were among the first applications of AI and
this remains an active field of research and commercial development.
However, expert system technology has been little used by biologists
for identification of specimens.  This may be surprising to AI
workers, since at first glance, the problem of identifying biological
specimens might seem to be little different from the diagnosis of any
other classes of things. Certainly, many of the same difficulties are
encountered in identifying biological material, for example,
missing,imprecise or ambiguous data, scarcity of special expertise,
and so forth.
     However, there are some subtle differences between the
identification of biological specimens and the identification of other
classes of things. With other types of objects, taxonomies can be
erected for particular purposes, and they are often constructed with
identification in mind. Biological taxonomies are generally based on
criteria that may be quite external to identification processes;
commonly, they are based on perceived relationships between taxa
(groups of organisms).  Identifying an unknown specimen involves
determining to some level its placement in such a classification.
Characters or attributes of organisms that are useful in inferring
relationships may not be very suitable for the purposes of
identification, and often characters are used for the special purpose
of identification that are known to be unreliable indicators of
relationship. Biological classifications are rigidly hierarchical and
non- overlapping (that is, at a given level, an organism belongs to
only one taxon).  Existing tools for identification may or may not use
the structure and logic of biological classifications to advantage.
     The study of the relationships between organisms is the primary
research activity of systematic biologists, and classifications are
perhaps the primary product of this research.  People are often
surprised to learn that this task has not been completed.  Far from
it, in many plant and animal groups only a small proportion of the
species in nature have been formally described and classified.  For
example, about 750,000 insect species have already been described, but
estimates of the number of undescribed species range from another
million or so up to 30 million. Obviously, with this many objects the
methods used to organize information are critical to the ability to
store and retrieve data.
     Although various approaches exist for classifying organisms,
classifications based on evolutionary relationships (phylogenetic
history) are generally preferred because they are more informative and
robust.  The development of explicit methods for the inference of
phylogenetic relationships given various kinds of data is an extremely
active area of biological research, with wide implications for other
fields of biology.  The point is that classifications of organisms
provide our only means of organizing the immense amounts of
information about organisms, and that the identification of specimens
is the critical first step in accessing this information.  In many
situations, for example interception of potential pests at border
stations, precise identifications are critically important.
     At present, biological identifications are performed by a very
small number of people, many of whom also have research and teaching
interests.  Because identifications per se are often among the less
interesting of one's potential activities, there is widespread
interest in developing more efficient methods.  There has been some
implementation of expert system tools for biological identification,
and examples of these will be presented at the Workshop.  It will be
of interest to biologists to see examples of successful expert systems
now used for identification or diagnosis in other fields.  Certainly,
workers experienced in artificial intelligence can provide guidance on
the types of problems that are suited (and perhaps more importantly,
the problems that are not suited) to expert system techniques.
     Specifically, the following areas are clearly relevant, and would
seem to be a common starting point for discussions between
systematists and AI workers.
1- An exploration of the methods now available for representation of
biological knowledge domains, specifically biological classifications
and supporting information, will serve as a foundation for much of the
workshop.  In this particular context, the ability to incorporate the
structure and logic of biological classifications into identification
devices should provide means to make them more powerful and robust.
Systematists will no doubt find various methods for representing
structure in a knowledge domain interesting (frames, semantic
networks, etc.), and AI researchers will probably find that these
knowledge domains have interesting and perhaps unique properties.
2- Methods for dealing with uncertainty in the identification process
are clearly of interest.  There are several sources of uncertainty in
this context: damaged or incomplete specimens, natural variability
among individuals of a species (or other taxon), user uncertainty as
to the interpretation of attribute states, etc.  We are aware that
fundamentally different methods exist for representing uncertainty in
AI (probabilistic methods, decision theory, fuzzy set theory and so
forth) and we would like to see their potential in this context
3- There will be concern about the practicality of implementing AI
methods in systematics.  Because many systematists now use database
techniques of some sort (although not always computer- based), an
exploration of techniques for rule induction would be useful.  Because
biological classifications are dynamic and research in many of these
taxa is ongoing, methods for revision and update of knowledge domains
are of interest.  Critical comparisons of commercially available
shells for expert systems and related issues (operating systems,
hardware) would be useful.
                     PHYLOGENETIC INFERENCE
                        George V. Lauder
                  School of Biological Sciences
                University of California, Irvine
                     BITNET: GLauder at UCIvmsa
     Two of the key interests of evolutionary biologists are (1)
reconstructing evolutionary pathways or sequences of change in
particular features of organisms, and (2) reconstructing genealogical
relationships among organisms (also called phylogenies or evolutionary
     As an example of the first interest, an evolutionary biologist
might wish to understand the origin of flight in birds.  What was the
historical sequence of modifications in the muscles and the skeleton
that occurred to allow early birds to fly?  One might be able to make
several a priori predictions about necessary morphological changes for
flight (such as lightening of bones, reorientation of muscles to make
the upstroke and downstroke of the wings possible, and lengthening of
the arm bones to increase surface area).  But how exactly did the
sequence of modifications occur in evolution to produce early birds
with flight?  Did skeletal lightening occur before arm elongation, or
was muscle reorientation the first change that occurred?

     An example of the second interest would be an evolutionary
biologist who simply wanted to reconstruct the genealogical
relationships among 20 species of birds.  How is each species
genealogically related to each other species?  In other words, the
biologist is interested in reconstructing the tree that describes that
genetic relationships among the species.
     Both areas might benefit greatly from input from AI specialists.
Most of the data sets gathered by workers in evolutionary biology
generate either large numbers of trees or produce ambiguous
reconstructions of evolutionary history.  The number of trees may be
so great or the ambiguity so extensive that assistance is needed in
summarizing significant features of the trees and key aspects of
character evolution.
A  Reconstructing historical pathways of change in individual characters.  
     Perhaps it would be most useful if the concrete (albeit
hypothetical) example of bird flight is used to examine the potential
application of artificial intelligence to systematic and evolutionary
biology and the difficulties now faced by systematic biologists in
trying to interpret the evolution of characters.  If we are given a
tree that represents the genealogical relationships of a group of four
bird species and wish to understand how the evolution of particular
characters has occurred, we may want to reconstruct the evolution of
those characters on the tree.
     Typically, such analyses begin with a "taxon-by-character" data matrix:
                              Morphological feature
                              1     2     3       4
     Taxon:   species  1      A'    B     C'      D
              species  2      A     B'    C       D'
              species  3      A     B'    C       D
              species  4      A'    B     C'      D'
     where, there are four species of birds, each one a row of the
data matrix, and four morphological features (1 to 4) each indicated
by a different letter.  Feature 1 could be the length of the arm
bones, and a short arm could be represented by the letter A and a long
arm by the letter A'.  Feature 2 could represent the weight of the
skeleton with B indicating a light skeleton and B' a heavy skeleton.
These characters would be determined by an examination of each species
and weighing and measuring of the muscles and bones.  How might we
birds given this distribution of species and morphological features?
If we have available a phylogenetic tree that indicates genealogical
relationships as follows:
                     species     4       3      2      1
                                 |       |      |      |
                                 |       |      |      |{
                                 |       |      --------
                                 |       |          | (Z)
                                 |       |          |
                                 |       ------------
                                 |             | (Y)
                                 |             |
                                        | (X)
(note that time runs up the page, and that nodes are indicated by the
letters X, Y, and Z) and we accept this tree for the moment as a true
depiction of the genealogical relationships of the four species of
birds, then it is possible to reconstruct the evolution of any
specific character on this tree.  Unfortunately, it can easily be seen
that there are two ways to reconstruct the evolution of morphological
feature 1. Bird species 1 and 4 share feature A' while species 2 and 3
share feature A.  Nodes Y and Z could be reconstructed as having A'
and a total of 2 evolutionary steps (A' to A) would occur in species 2
and 3, or one could reconstruct nodes Y and Z as having feature A and
2 evolutionary steps would be required (A' to A from node X to Y, and
A to A' from node Z to species 1).  The key dilemma is that it is
possible to reconstruct the evolution of this character in two very
different ways both of which require the same number of steps.  This
is of course true for each character that is not completely consistent
with the given tree.  When forty or fifty morphological features are
used in a study it is clear that any attempt to understand the
evolution of these characters is made extremely difficult by our
inability to examine the patterns of variation in reconstructed
character evolution and exactly how each character has changed.
     For a biologist interested in the evolution of a structure such
as the bird wing, this ambiguity in reconstruction and our inability
to conveniently summarize divergent results makes interpreting the
history of morphological changes extremely difficult.  It would be of
considerable interest to know, for example, if the ancestor of each of
the four bird species above possessed the character A', a long arm, or
if the process of morphological evolution is reconstructed as having
involved a ancestor with a long arm that became shorter in the
ancestors of species 2 and 3 and then became long again in species 1.
     It seems likely that artificial intelligence techniques could be
used to summarize the possible reconstructions of characters on a
given tree and present a summary of the major patterns of variation.
Ideally, information on the nature of the characteristics of the
species could be added so that changes in arm bone characters could be
evaluated independently of changes in skull characters.  The central
problem in character reconstruction is that too many possibilities
exist to allow an easy understanding or visualization of the major
patterns of character evolution.
     Evolutionary biologists need approaches and techniques that will
permit visualization and an overall perspective on the transformation
of characters on trees.
B   Reconstructing genealogical relationships
     The above discussion has assumed that a particular phylogenetic
tree is given to work with.  But a similar general problem to that
encountered above that could greatly benefit from contributions of
artificial intelligence occurs when we attempt to reconstruct such
trees in the first place. Given a particular taxon-by-character data
matrix, there may be many trees at the shortest length.  The most
commonly accepted criterion for choosing a tree is that the tree
should involve the shortest number of evolutionary steps (i.e., select
the tree that has the shortest total length).  But a given data matrix
may produce many (hundreds) of equally short trees.  Any contribution
that artificial intelligence techniques could make to summarizing this
variation in tree topologies would be of great help to evolutionary
biologists in their attempts to reconstruct genealogical patterns
among organisms.
                        Julian Humphries
                       Cornell University
                        Ithaca, NY 14850
     There are two general areas where we anticipate that artificial
intelligence techniques will be useful in this arena.
     There exists in the natural history museums an enormous reservoir
of information about the distributions of organisms.  Much of this
information is being transferred to computerized databases,
potentially enhancing the process of understanding species
distributions.  Unfortunately, much (if not most) of these data are
accompanied not by precise descriptors of where the specimens were
collected but anecdotal descriptions of how the collector got to the
site (eg.  12 airmiles NNW of the intersection of State Hwy 12 and US
1).  Such locality depictions may actually refer to a very precise
place, yet without first hand knowledge of the area most researchers
will have to resort to maps to actually determine the location.  There
are literally millions of collections which under our current system
would need anywhere from 1-30 minutes each to determine a latitude and
longitude (or other standardized coordinate system).  Such a daunting
task means that it will only be attempted when an individual
researcher needs data for a particular taxa.
     It is hoped that sufficient rules about places (ie. cities,
towns, counties, highways, geographic features, etc) on earth could be
tabulated to at least semi-automate this process.  The basic task
would be one of parsing the anecdotal descriptions, deciphering an
approximate or actual latitude and longitude from the data,
determining a level of reliability or resolution (ie.  localities
where the data consist solely of "New World" should translate into an
equally unresolved 'exact' description), perhaps showing a plot of the
locality to a human operator on a bit-mapped screen, and finally
storing the result into a extensible database.  Although this still
requires human intervention, the process should be orders of magnitude
faster and ultimately more accurate.
        There are a number of confounding factors.  Our data have a
significant temporal component, having been accumulated for over the
last 100 years.  As such, the underlying knowledge base will need to
know what "Germany" means in all its various historical contexts.
Because we are dealing with biological objects, this too will taint
the descriptions.  As an example, many collections will be precisely
located, but refer to a transect rather than a point (e.g. a oceanic
trawl).  The need for a measure of the resolution achieved can not be
stressed too strongly.  Unless we have some indication of when the
process failed we will have little faith in the translation.
        The other related subject also concerns data extraction, but
in a more abstract context and certainly not restricted to
systematics.  The process of accumulating knowledge is complicated,
but at some level we acquire pieces of information from some larger
construct.  As the scientific knowledge base expands we are forced to
spend larger amounts of time in the process of simply acquiring
information (prior to any processing).  Most of us have at times
wished someone (or something) could scan our journals for us and tell
us which articles we need to read.
        It seems to me that I could, with time, build up a framework
describing my preferences, needs, priorities and other individual
aspects of my research program.  Such a framework could be used as set
of rules to guide a "AIde" program which would glean the "important"
information from the original literature.  Note that I am begging the
question of how these data will be represented.  I presume that in
coming years a greater proportion of our source material will be in
some machine readable form, we need to have the tools ready to take
advantage of that situation.
                         F. James Rohlf
               Department of Ecology and Evolution
                  State University of New York
                  Stony Brook, NY 11794-5245  
     Most data presently used in systematics are collected through the
visual examination of specimens.  Features are usually found by the
visual comparison of specimens and most measurements are taken
visually.  These activities can be quite time consuming. Thus there is
the potential for saving a systematist's time if appropriate hardware
and software were available that would enable routine measurements to
be made automatically.  This would permit more extensive large- scale
quantitative studies.
     But automation is difficult in systematics since the features to
be measured are usually not easily separated from the background,
i.e., the visual scene is often cluttered, and the structures of
interest may not have distinct colors or intensities as in many
industrial applications of image analysis. The problem is especially
difficult for certain groups of organisms.  The problem is further
complicated due to biological variability.  One usually cannot depend
upon homologous structures having consistent geometrical features that
can be used to automatically identify landmarks of interest.  Other
important complications are that most structures of interest are
3-dimensional and that the "texture" of surfaces often contains
taxonomically useful information.  Both aspects are difficult to
capture with presently available hardware and software.
     For these reasons present applications of image analysis in
systematics have been quite modest.  In studies where data are
recorded automatically, time is spent simplifying the image.  For
example, structures of interest are physically separated from the rest
of the specimen and placed upon a contrasting plain background so the
outline can be found with little error.  Alternatively, an
investigator can identify structures of interest by pointing to them
with a mouse, watching how a program finds an outline, and them
editing the trace if necessary.  Working from this outline, additional
landmarks can be identified by the operator.  In some cases these
landmarks can be associated with geometrical features of the outline
and it will be possible for the software to help the operator to
accurately locate these points.  Due to the difficulty of solving the
general problems of the automatic analysis of complex biological
scenes, a more immediate goal should be to develop powerful tools that
a systematist can interact with to isolate structures, locate
landmarks, and compute various measurements.  In addition, it would be
desirable for the software to "learn" how to recognize the structures
so that the process will go faster as both the software and the
systematist become more experienced.
   Once the structures and landmarks have been found they are usually
recorded so that, if necessary, additional measurements can be made
without having to go back to the original image. These are usually in
the form of x,y-coordinates of landmarks or chain-coded outlines.  For
very large studies, methods to compress this raw descriptive
information need to be used.
   The features that are measured are usually the same types of
features that would have been measured by hand -- 2-dimensional
distances between landmarks or angles between pairs of landmarks. In
some studies the features used are parameters from functions (such as
Fourier, cubic splines, Bezier curves) fitted to the shapes of
structures or of entire outlines of organisms.  More work is needed to
develop new types of features and to evaluate the implications of
their use relative to traditional methods.

More information about the Biomatrx mailing list