Benchmarks for Gene Identification Programs

Eric E. Snyder eesnyder at boulder.colorado.edu
Wed Sep 9 11:37:48 EST 1992


A couple of weeks ago, I posted a 'Wish-List' for gene identification 
programs.  This wish-list contained a number of statistics I would like
to see for coding sequence identification programs like GeneID and GRAIL. 
The purpose of having data of this type is two-fold: it would aid those
involved in the development of such programs (such as myself) to gauge
the performance of their programs as well as assist researcher seeking to
find the program best suited to their particular problem.  

I have received a number of very informative responses to the problem
I posed.  In addition to the need for a standardized battery of performance
statistics, chosing the appropriate set of test sequences is equally 
important.  All programs I have seen have been parameterized to perform
best on a specific type of sequence.  GRAIL, for example, is tailored to
identifying exons in human genes and performs less well on non-human
sequences.   In contrast, GeneID is geared to parsing genes from
vertebrates in general, at the expense of reduced performance on human
sequences.  While these differences in intent are clearly documented,
they may not be clear to the casual user.  Furthermore, one would like 
to have a quantitative measure of performance on a number of different
input data types, regardless of the original objective of the program. 

Steen Knudsen and Roderic Guigo have supplied me with the data from 
their comparisons of GeneID and GRAIL.  These test sets are independent
of those used to train the programs and differ enough to get a feel
for how these programs perform on different types of data.  The results
obtained on these datasets also fulfill most of the requirements of 
my original 'wish-list'.   As suggested by Steen Kundsen, I hope these
benchmark datasets and statistics can serve as a model for others 
working on this problem and also provide 'consumers' with useful
information about the utility of these programs in their research.  

Finally, I would encourage others working in this field to do use these
datasets to test their own programs and make suggestions for better
test sets.   If researchers will send me their data,  I will maintain a 
list of the results and standard test sets and periodically post 
this information to the net.  

Thanks again to Steen Knudsen and Roderic Guigo for sharing their data
and for providing a starting point from which we can begin to structure 
the research in this field.  




From: rgs at temin.Lanl.GOV (Roderic Guigo

DATA.
The sets of genes described in PNAS(GRAIL) and JMB(GeneId) as tests sets
were considered. 
>From JMB, the independent data set of 28 genes was used (GI-set), 
instead of the larger one of 169 genes, since this 
last set was partially used during the developement of GeneId. 
>From PNAS, the set of 19 genes was used . 
One gene (HUMTPA) was removed from the GR-set, because it excedes the maximum
allowed length in the current version of GeneID
and one gene (HAMRPS14A) was removed from GI-set  
because I was unable to retrieve the sequence from GenBank, rel. 72. 

In summary, two sets of genes were considered:
GR-set: 18 human genes
GI-set: 27 vertebrate genes

for which the **entire** sequence in the GenBank rel. 72 entry was obtained.


METHOD

1. Both sets were submitted to both GRAIL and GeneId e-mail servers.

	[
	objection 1: GRAIL was specifically  developed to analyze human genes, 
	while GeneId was intended more generally to vertebrate genes. The
	GI-set contains some non-human genes.

	objection 2: GeneId was primarily designed to analyze transription units
	-that is DNA sequences corresponding  exactly to the transcribed mRNA.  
	The entire sequence in a GenBank entry it is not usually restricted to
	the transcription unit.
	]

2. Output from GRAIL and GeneId was parsed to obtain the GRAIL and GeneId
predictions for each gene. 
The top model from the GeneId predictions was considered the GeneId prediction
(GI-prediction).
The set of all 'forward' exons (excellent+good+marginal) in the GRAIL 
predictions were considered the GRAIL prediction (GR-prediction).

In a sense, GI and GR predictions are not exactly comparable. GI-predictions
are assembled genes -linear combinations of exons satisfying a number of
rules-, while GR predictions are linear arrangements of non-overlapping
exons, which not necessarily and not usually can be assembled in
continuos genes. However, in another sense they are comparable, since both 
assign to each possition on the input sequence a binary value: 
coding/non-coding. And this is the way in which they are evaluated in their
respective papers.

	[
	objection 1: Only excellent (or excellent+good) exons in the GRAIL 
	predictions could be considered to obtain the GR-predictions.

	objection 2: Predictions in the complementary strand could be taken into 	
	account. Since the GR-set and the GI-set 
	do not contain genes in the complementary strand this means 1) 
	considering all GRAIL 'reverse' exons as false positives. 
	2) running GeneId on the complementary strand and consider the putative 
	whole GeneId prediction as false positive.
	]

3.  For each gene and prediction,  I computed DNA_length, true_CDS, 
predicted_CDS and true_predicted_CDS (true positive, tp). 
>From this data I computed the proportion of true_CDS being correctly 
predicted (s1), the proportion of predicted_CDS being actually true (s2) 
and the corresponding correlation coefficient (cc).

In addition,  I computed the **average** s1, s2 and cc for each separate data 
set, GR-set and GI-set, and for the combined data set GR+GI-set, as in the
JMB paper.
I also computed s1, s2 and cc for the **total** number of positions analyzed.
Again for each separate data set, GI and GR, and for the combined set
GI+GR, as in the PNAS paper.

RESULTS

(Note that for both GeneId and GRAIL the results obtaine for their own data
sets (GI-set and GR-set) differ from those published in their respective paper.
For GeneId, results are slightly worse,  because 1) a different version of
GeneId is currently being executed and 2)  Now the entire GenBank entry,
and not only the transcription unit, is considered.
For GRAIL, the results are slightly better, because 1) a different version of
GRAIL is currently being executed and ?.


EVALUATION OF GENEID AND GRAIL (entire GenBank entry) R.guigo, lanl, aug-92

A: GRAIL-SET (locus HUMTPA removed)
A.1. GRAIL                                       
                        DNA   CDS  pred   TP      s1    s2    cc
HUMALPHA               4556  1599  1010  1000    0.63  0.99  0.71
HUMAPRT                3016   543   566   366    0.67  0.65  0.58
HUMFOS                 6210  1143   538   412    0.36  0.77  0.46
HUMMETIA               2941   185   133    30    0.16  0.23  0.15
HUMNMYCA               8762  1630  1307  1290    0.79  0.99  0.86
HUMP45C17              8549  1526   898   869    0.57  0.97  0.71
HUMPAIA               17509  1208  1151   840    0.70  0.73  0.69
HUMPLPSPC              3409   593   323   287    0.48  0.89  0.61
HUMPNMTA               4174   848   804   751    0.89  0.93  0.89
HUMPOMC                8658   803   750   704    0.88  0.94  0.90
HUMPRCA               11725  1386  1065   918    0.66  0.86  0.73
HUMPRPH1               4946   501   363   261    0.52  0.72  0.58
HUMRASH                6453   570   605   503    0.88  0.83  0.84
HUMSAA                 3460   369    82    82    0.22  1.00  0.45
HUMTBB5                8874  1335  1415  1225    0.92  0.87  0.87
HUMTCRAC               5089   426   253   186    0.44  0.74  0.54
HUMTHB                20801  1869  1152  1067    0.57  0.93  0.71
HUMTKRA               13500   705   577   436    0.62  0.76  0.67

average         


More information about the Comp-bio mailing list