A Wish-List for Gene Identification Programs

Eric E. Snyder eesnyder at boulder.colorado.edu
Fri Aug 28 15:46:11 EST 1992

I have been working on a program that identifies coding sequences
from genomic DNA.  I have also been studying the performance of 
a number of the more recent programs/papers on the topic.  I have 
been struck by the singular lack of consistency in the way the 
performance of these systems is evaluated.  While some of this 
inconsistency is due to differences in the limitations or intent 
of the different programs, I think it would be helpful to every-
one to develop a set of standard performance measures by which to
judge the success of each new method.  

I would like to see the following data:

Results in terms of complete sequences:

  Exons (in)correctly predicted (both donor and acceptor sites correct)
  Exons partially predicted (one donor or acceptor site correct)
  Exons partially predicted (prediction overlaps actual exon but 
                             boundaries incorrect)

  Number of exons (of all types) for which the reading frame is (in)correctly 

Results in terms of nucleotides:

  Nucleotides (in)correctly predicted as exon  
  Correlation coefficient for exon prediction

Results in terms of splice sites:

  Correlation coefficient for splice site predictions 

Results in terms of assembled genes:

  Number of amino acids in predicted protein (in)correctly predicted

I find that most papers on the subject will analyze their data in terms 
of one or two of these catagories.  Often, it is difficult to get a 
feeling for rates of false positives and never is possible to calculate
the correlation coefficients mentioned above from the raw data.  While
I appreciate one must work within space limitations but there are so many
programs of this type appearing in the literature, it is important to 
be able to attempt some objective comparison.  

So, since I too am working on this problem, I would like to compile 
a wish-list of statistics that bionet readers would like to see.  
What critical bit of information have you found lacking in the analysis
of GeneID, GeneModeler, GRAIL, NETGENE, SORFIND, or whatever program
you have been using.  I will compile a list of such stats and post if 
there is sufficient interest.


Eric E. Snyder                            
Department of MCD Biology              ...making feet for childrens' shoes.
University of Colorado, Boulder   
Boulder, Colorado 80309-0347

More information about the Comp-bio mailing list