Software to read whole genomes?

mathog at seqaxp.bio.caltech.edu mathog at seqaxp.bio.caltech.edu
Mon Sep 29 11:43:22 EST 1997


In article <Pine.A41.3.95.970926125911.54182D-100000 at aix1.uottawa.ca>, "colossus..." <s535290 at aix1.uottawa.ca> writes:
>
>The major problem is that since these sequences contain extraneous bases,
>trying to come up with a "correct" consensus sequence (to then translate
>in order to derive phylogenetic trees based on protein sequences) would be
>relatively nightmarish. Any thoughts ?

You are correct - it is a nightmare.  

You must derive the consensus sequence from each pile of ESTs, treating
them like a sequencing project, before you go on to do a phylogenetic
analysis between them.  If you just plow ahead and try to align a bunch of
ESTs, using the standard tools (such as Pileup, in GCG), the alignment will
be awful.  There is a good reason for this.  Most of these tools assume
that differences between sequences have an evolutionary basis - if two
sequences share a commom variation, they are "closer" than if they do not.
Unfortunately, in ESTs much of the variation is noise, shared variations
are too frequently random events, and the more ESTs you add, the worse the
problem gets.

Then there is the problem that there is no automatic way to figure out
where to chop off the ends of the EST, or indeed, to even figure out the
orientation of the sequence.  Can anybody explain to me why sequences in
the EST databases are not all converted to the 5->3' orientation, even when
sequenced in the other direction?  The direction of sequencing is AN
EXPERIMENTAL ARTIFACT, the direction of every other mRNA is always given in
the proper orientaion, right?!?!? 

Anyway, in summary, if you must attack this problem, use a good sequence
assembly program (gelassemble in GCG can handle a fair amount of this EST
noise), edit each pile of ESTs down to a single consensus sequence, and then 
align the consensus sequences.

Good luck,

David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 
**************************************************************************
*Affordable VMS? See:  http://seqaxp.bio.caltech.edu:8000/www/pcvms.html *
**************************************************************************



More information about the Methods mailing list