A couple of weeks ago, I posted a 'Wish-List' for gene identification
programs. This wish-list contained a number of statistics I would like
to see for coding sequence identification programs like GeneID and GRAIL.
The purpose of having data of this type is two-fold: it would aid those
involved in the development of such programs (such as myself) to gauge
the performance of their programs as well as assist researcher seeking to
find the program best suited to their particular problem.
I have received a number of very informative responses to the problem
I posed. In addition to the need for a standardized battery of performance
statistics, chosing the appropriate set of test sequences is equally
important. All programs I have seen have been parameterized to perform
best on a specific type of sequence. GRAIL, for example, is tailored to
identifying exons in human genes and performs less well on non-human
sequences. In contrast, GeneID is geared to parsing genes from
vertebrates in general, at the expense of reduced performance on human
sequences. While these differences in intent are clearly documented,
they may not be clear to the casual user. Furthermore, one would like
to have a quantitative measure of performance on a number of different
input data types, regardless of the original objective of the program.
Steen Knudsen and Roderic Guigo have supplied me with the data from
their comparisons of GeneID and GRAIL. These test sets are independent
of those used to train the programs and differ enough to get a feel
for how these programs perform on different types of data. The results
obtained on these datasets also fulfill most of the requirements of
my original 'wish-list'. As suggested by Steen Kundsen, I hope these
benchmark datasets and statistics can serve as a model for others
working on this problem and also provide 'consumers' with useful
information about the utility of these programs in their research.
Finally, I would encourage others working in this field to do use these
datasets to test their own programs and make suggestions for better
test sets. If researchers will send me their data, I will maintain a
list of the results and standard test sets and periodically post
this information to the net.
Thanks again to Steen Knudsen and Roderic Guigo for sharing their data
and for providing a starting point from which we can begin to structure
the research in this field.
From: rgs at temin.Lanl.GOV (Roderic Guigo
DATA.
The sets of genes described in PNAS(GRAIL) and JMB(GeneId) as tests sets
were considered.
>From JMB, the independent data set of 28 genes was used (GI-set),
instead of the larger one of 169 genes, since this
last set was partially used during the developement of GeneId.
>From PNAS, the set of 19 genes was used .
One gene (HUMTPA) was removed from the GR-set, because it excedes the maximum
allowed length in the current version of GeneID
and one gene (HAMRPS14A) was removed from GI-set
because I was unable to retrieve the sequence from GenBank, rel. 72.
In summary, two sets of genes were considered:
GR-set: 18 human genes
GI-set: 27 vertebrate genes
for which the **entire** sequence in the GenBank rel. 72 entry was obtained.
METHOD
1. Both sets were submitted to both GRAIL and GeneId e-mail servers.
[
objection 1: GRAIL was specifically developed to analyze human genes,
while GeneId was intended more generally to vertebrate genes. The
GI-set contains some non-human genes.
objection 2: GeneId was primarily designed to analyze transription units
-that is DNA sequences corresponding exactly to the transcribed mRNA.
The entire sequence in a GenBank entry it is not usually restricted to
the transcription unit.
]
2. Output from GRAIL and GeneId was parsed to obtain the GRAIL and GeneId
predictions for each gene.
The top model from the GeneId predictions was considered the GeneId prediction
(GI-prediction).
The set of all 'forward' exons (excellent+good+marginal) in the GRAIL
predictions were considered the GRAIL prediction (GR-prediction).
In a sense, GI and GR predictions are not exactly comparable. GI-predictions
are assembled genes -linear combinations of exons satisfying a number of
rules-, while GR predictions are linear arrangements of non-overlapping
exons, which not necessarily and not usually can be assembled in
continuos genes. However, in another sense they are comparable, since both
assign to each possition on the input sequence a binary value:
coding/non-coding. And this is the way in which they are evaluated in their
respective papers.
[
objection 1: Only excellent (or excellent+good) exons in the GRAIL
predictions could be considered to obtain the GR-predictions.
objection 2: Predictions in the complementary strand could be taken into
account. Since the GR-set and the GI-set
do not contain genes in the complementary strand this means 1)
considering all GRAIL 'reverse' exons as false positives.
2) running GeneId on the complementary strand and consider the putative
whole GeneId prediction as false positive.
]
3. For each gene and prediction, I computed DNA_length, true_CDS,
predicted_CDS and true_predicted_CDS (true positive, tp).
>From this data I computed the proportion of true_CDS being correctly
predicted (s1), the proportion of predicted_CDS being actually true (s2)
and the corresponding correlation coefficient (cc).
In addition, I computed the **average** s1, s2 and cc for each separate data
set, GR-set and GI-set, and for the combined data set GR+GI-set, as in the
JMB paper.
I also computed s1, s2 and cc for the **total** number of positions analyzed.
Again for each separate data set, GI and GR, and for the combined set
GI+GR, as in the PNAS paper.
RESULTS
(Note that for both GeneId and GRAIL the results obtaine for their own data
sets (GI-set and GR-set) differ from those published in their respective paper.
For GeneId, results are slightly worse, because 1) a different version of
GeneId is currently being executed and 2) Now the entire GenBank entry,
and not only the transcription unit, is considered.
For GRAIL, the results are slightly better, because 1) a different version of
GRAIL is currently being executed and ?.
EVALUATION OF GENEID AND GRAIL (entire GenBank entry) R.guigo, lanl, aug-92
A: GRAIL-SET (locus HUMTPA removed)
A.1. GRAIL
DNA CDS pred TP s1 s2 cc
HUMALPHA 4556 1599 1010 1000 0.63 0.99 0.71
HUMAPRT 3016 543 566 366 0.67 0.65 0.58
HUMFOS 6210 1143 538 412 0.36 0.77 0.46
HUMMETIA 2941 185 133 30 0.16 0.23 0.15
HUMNMYCA 8762 1630 1307 1290 0.79 0.99 0.86
HUMP45C17 8549 1526 898 869 0.57 0.97 0.71
HUMPAIA 17509 1208 1151 840 0.70 0.73 0.69
HUMPLPSPC 3409 593 323 287 0.48 0.89 0.61
HUMPNMTA 4174 848 804 751 0.89 0.93 0.89
HUMPOMC 8658 803 750 704 0.88 0.94 0.90
HUMPRCA 11725 1386 1065 918 0.66 0.86 0.73
HUMPRPH1 4946 501 363 261 0.52 0.72 0.58
HUMRASH 6453 570 605 503 0.88 0.83 0.84
HUMSAA 3460 369 82 82 0.22 1.00 0.45
HUMTBB5 8874 1335 1415 1225 0.92 0.87 0.87
HUMTCRAC 5089 426 253 186 0.44 0.74 0.54
HUMTHB 20801 1869 1152 1067 0.57 0.93 0.71
HUMTKRA 13500 705 577 436 0.62 0.76 0.67
average