Genome databases: strengths and weaknesses
howe at DARWIN.UCSC.EDU
Tue Nov 26 13:10:19 EST 1996
I appreciate your optimism, but I think you're confusing my realism with
pessimism. As a molecular geneticist with bioinformatics training, I am
well aware of the pitfalls and shortcomings of in silico analysis of
genome databases. I do applaud the efforts and achievements of
sequencing labs- their work has benefitted everyone more than they will
ever be creditted for- but the simple fact of the matter is that
scientists- and as a consequence, their computers- do not know
everything there is to know about how genes are arranged. As a result,
the more we learn about gene structure, the better we are at predicting
the function of a particular sequence.
The recent publications of two complete genomes illustrate my point:
1. I refer you to the article in Science (C. Bult et al 1996
V273:1058-1073) in which the authors describe completion of the sequencing
of the genome of an archae, M. jannaschii, and its analysis. It is clear
that the list of identified protein coding and structural RNA-encoding
genes is incomplete. In fact, this sequence will continue to be a source
of discoveries probably into the next century.
2. Completion of the S. cerevisiae genome occured earlier this year,
yet identification of new ORFs continues, as does the characterization of
previously identified ORFs. Since I work on pre-mRNA splicing in this
system, it is my opinion that new ORFs will be identified with introns
having more degenerate splice site signals than the consensuses used in
searching for potential coding sequences. The problem is exacerbated in
higher eukaryotes, where splicing signals are even more degenerate, thus
making the identification of introns and thus of flanking coding sequences
much more difficult. Finally, a relatively recent discovery of a family
of structural RNAs called the snoRNAs- many of which, because they are
encoded in the introns of other genes are "genes-within-genes" (see A. G.
Balakin et al Cell 1996 V86:823-834 and references therein)- demonstrates
that our knowledge about gene structure is incomplete.
My original observation was meant as a warning to approach completed
sequences with an open mind: that there are deficiencies in our knowledge
about their structure as well as in the integrity of these sequences (as I
explained earlier). I heartily promote the analysis of sequence
databases- I personally have gained much from doing just that- but it
should also be said that sequences are not self-explanatory in their
nature but, rather, should be viewed only as data points, and as such
subject to our interpretation.
Tue, 26 Nov 1996, Joe
> Date: Tue, 26 Nov 1996 09:16:44 -0600
> From: Joe Miano <jmiano at post.its.mcw.edu>
> To: Ken Howe <howe at darwin.UCSC.EDU>
> Subject: Re: Functional Genomics!!
> Pessimism cost the Soviets first place in the race to the moon!
> What with all of the mass sequencing going on and infomatics in high gear I
> must say that I am a little more optimistic that all ORFS will be
> identified, perhaps even before the 2005 target date. Regardless, this fact
> remains: Genome people must begin thinking about what they will do when
> their job is complete. There's probably several lifetime's of work that
> will be needed to carry gene sequence to gene function.
> Thanks for the note!
"You know, it sure would be amino world without RNA"
Center for the Molecular Biology of RNA
Department of MCD Biology
University of California
Santa Cruz, CA 95064
e-mail: howe at darwin.ucsc.edu
More information about the Methods