Using computers to study mushrooms (2 of 2)

Nathan J. Wilson nathan at cse.ucsc.edu
Fri Sep 3 15:35:41 EST 1993


This is the second of two documents related to my efforts to create 
computer programs that will be helpful for studying mycology.  The one
below is a report written for an independent study I did last year in
graduate school.  It is aimed at knowledgable computer scientists with 
little or no knowledge of mushrooms.

-- Begin report --

Final Project Report
Independent Study - Winter 1993
Developing a Computer System for Taxonomic Identification

Nathan Wilson
March 25, 1993


* Goals *

The purpose of this project is to develop a general purpose taxinomic
database using fungi as a case study.  The general concept behind
the system is for the user to be able to specify values for a set of
features and have the system search through a database of species
descriptions for matches to the specified set.  The effort for the
project had four principle components: developing the database
engine, designing and creating the user interface, creating a small
trial database, and an attempt to find an efficient search method.
In addition, the system includes an advisory facility for helping the 
user to decide which feature to enter values for next.


* The System *

The system as it currently stands includes a database and a program.
The program, in turn, is conceptually divided into two parts: the
database engine which is responsible for executing any requested
transactions on the database, and the user interface which gets input
from the user, passes these requests on to the database
engine, and displays result returned by the database.  The database
engine communicates with the user interface through a small command
language.  The goal of creating this distinction is to aid in future
attempts to port the system to other platforms.  At the moment, the
system runs only on Apple Macintoshes, though I expect to create a
command line interface for it in the near future.

 + Database Engine +

The database consists of a set of 'descriptions', one for each
species.  A description consists of a set of 'features'.
The features each have a unique 'field' name by which they can be
accessed, and a set of 'values'.  Currently the values are a
subset of a set of names associated with each feature.  There are two
special fields `Genus' and `species'.  Each of these fields have a
specially marked value, known as the 'canonical' value.  The
combination of the canonical value for the `Genus' and `species'
fields uniquely identify a particular description.

The database engine includes facilities for reading, writing and merging
databases; adding new descriptions, features or values; and searching the
database.

 + The User Interface +

The user interface for the Macintosh includes two windows, the
'selection' window, and the 'feature' window.  The selection window
includes a scrolling list of the canonical genera.  One of these can
be selected from the list either with the mouse or by typing.  Name
completion and keyboard scrolling are supported.  After a genus has
been selected a list of the canonical species names which have that
genus as their canonical genus appears in another list.  Selecting a
species causes that description to be added to a special sub-database
called the 'selection'.  Most of the transactions the user
performs on the database are with respect to the selection.  For
example, the advisory system only looks at the members of the
selection when deciding which features to suggest.

The user typically interacts with the system by constructing their
own, typically incomplete, description known as the 'key' in the
feature window.  The process of constructing the key is similar to the
species selection process.  The key can be inserted directly into the
database as a new description, or it can be used to manipulate or extend
the selection.  The descriptions in the selection can be changed by
either adding or removing values for each feature in the key.  If
a description already has a value for a given feature, then a set
union or set difference operation is performed.

 + Searching +

The search facilities are also controlled with the key.  Descriptions
that match the key can be either added or removed from the selection.
In addition, all descriptions that do not match the key can be removed 
from the selection.  For completeness it might be good to add an option 
to add all descriptions from the database that do not match the key, but I
have yet to find a need for this option.

The user can choose between two matching policies.  The 'or' policy is
that a description and the key are considered to match if the intersection
of the values for each of the features in the key is non-empty.  The
'and' policy is that a match only occurs if for each feature the
values in the key are a subset of the values in the description.  In
my own use of the system I have used the 'or' policy almost
exclusively.

 + Most Specific Generalizers and Concepts +

Another search related feature is that the key can be modified by 
selecting a particular species from the selection window and finding the 
most specific generalizer between that description and the key.  If the 
key is empty then the key becomes the entire description.  There is also a
facility that attempts to construct an entire partial order of the possible
generaliers for the entire set of descriptions.

 + The Advisory System +

The advisory system works by sorting the features on the basis of the
selection.  The user can choose between three different sorting methods.
The simplest is the 'Count' method.  For each feature the number
of values used in the selection is counted.  A feature is considered to
be better if it has more values.

The other two keys are based on not only the number of possible
values, but also the number of times each value occurs over the entire
selection.  In the 'MinMax' method, the feature whose most common
value occurs the least often in the selection is considered the best.  
Finally, the 'LogSum' method uses the sum of the logs of the value counts
to sort the feature, with the highest sum being the best.

 + The Database +

The database created this quarter contains descriptions for 50 species
of fungi.  All descriptions were based on published descriptions of
the species ([Arora, 1986], [Jenkins, 1986], and [Thiers et al.]).
Each species description has 58 features whose values are each a
subset of the values possible for that feature.  The choice of
features and values was based on [Largent, 1986].  There are between 2
and 50 values for each feature, and 333 values across all the
features.  Most features had less than 10 possible values.


* Searching for Efficient Search *

(illustration showing number of concepts as species are added)
        
The focus of this part of the effort was to try to use the partial
order of the most specific generalizers across all the subsets of the
species.  The hope was that this partial order would be
sparse due to the latent structure created by the biological
relatedness of species.  The figure shows that some
savings was in fact found over the worst case exponential growth that
would be expected with a randomly selected set of descriptions.
However, the savings is not sufficient to consider using this partial
order directly to help structure the search or retrieval tasks.
It may be that the structure that is there can be exploited
by restricting the partial order to only those subsets that have
a sizable membership.


* Evaluating Sorting Methods for the Advisory System *

The principle result of this aspect of the work is a comparison
between the sorting methods.  The MinMax sorting method is judged to
be better, though no standard measure was used to directly confirm
this judgement.

 Count                        MinMax                     LogSum
------------------------------------------------------------------------------
Habitat                    | Spore surface color      | Habitat
Stem surface texture       | Gill distance            | Months
Cap surface color          | Spore color              | Cap surface color
Stem surface texture below | Spore surface attachment | Spore surface color
Stem surface texture above | Cap surface color        | Cap shape
Spore surface color        | KOH reaction             | Cap cross section
Months                     | Cap shape                


More information about the Mycology mailing list