Software to read whole genomes?
eddy at wol.wustl.edu
Mon Sep 29 17:48:24 EST 1997
In article <60olra$2cj at gap.cco.caltech.edu> mathog at seqaxp.bio.caltech.edu writes:
>Then there is the problem that there is no automatic way to figure out
>where to chop off the ends of the EST, or indeed, to even figure out the
>orientation of the sequence. Can anybody explain to me why sequences in
>the EST databases are not all converted to the 5->3' orientation, even when
>sequenced in the other direction? The direction of sequencing is AN
>EXPERIMENTAL ARTIFACT, the direction of every other mRNA is always given in
>the proper orientaion, right?!?!?
AFAIK, in the Genbank files of all WashU EST data including
Merck/WashU human ESTs and NCI/WashU ESTs:
- sequencing vector and low-quality start is clipped off before submission
- the high-quality stop point is annotated in a parsable form
in the COMMENT field:
High quality sequence stop: 155
- the orientation of the read is annotated redundantly in parsable form
in the DEFINITION field:
zv37h04.s1 Soares ovary tumor NbHOT Homo sapiens clone 755863 3'
<clone plate location>.[s,r]1 <library> <clone ID> [5,3]'
where an s1 is a 5' read; r1 is a 3' read.
I'm not directly involved in the ESTs, so take this as probably right
but not definitive. LaDeana Hillier and Marco Marra at the WashU GSC
could answer definitively if you email them.
Since all this information is in the file, the orientation of the
sequence is arbitrary. Since ESTs are artifact prone and highly
automated, the idea is to provide the off-the-machines orientation and
stick to the observed data, without piling on more automatic
inferences than necessary. A 3' read is a 3' read, usually but not
necessarily a 3'->5' sequence of an mRNA. You shouldn't blind-faith
reverse complement them. You aren't guaranteed that the clone is from
an mRNA; you aren't guaranteed that the directional cloning worked;
and though the EST sequencing group is very conscientious and very
good at what they do, you aren't guaranteed that there wasn't an
artifact in the labeling or handling of the plates or sequencing
- Sean Eddy, Ph.D.
- Dept. of Genetics, Washington University School of Medicine
- 660 S. Euclid Box 8232, St. Louis MO 63110, USA
- mailto://firstname.lastname@example.org http://genome.wustl.edu/eddy
More information about the Bio-soft