making a "true" est consensus

Peter Rice pmr at sanger.ac.uk
Thu Dec 19 00:05:52 EST 1996


In article <58uvpv$9bp at net.bio.net> "Dr. Rob Miller" <rmiller at house.med.und.ac.za> writes:
>    How do we link 3' EST to 5' EST fragments from the same clone in order 
>   to make the linked consensus useful for subsequent searching, alignment
>   and/or translation?

>    We're developing a set of EST consensus sequences to submit to a public
>   database, and naturally we'd like these to be of the greatest utility
>   possible.  We are thinking about the most useful format for the
>   submission.

Hmmm, which public database would this be? EMBL/GenBank and dbEST
already have EST sequences, and presumably would not want them
duplicated with gaps.

There are also other databases of EST consensuses already, most notably
UNIGENE from NCBI.

>    What is the  best way to link data for ESTs which come from the same 
>   clone

You could keep the clone name around (as the entry name?) so hits are
"obvious" and keep the ESTs separate unless clearly joined. That saves
adding gaps.

But this is a tricky issue at present - alternative solutions would be
welcome.

>    Specifically, we'll be creating artificial consensus sequences from two
>   EST consensuses, e.g. a 5' EST AAAAAAAAAAAAAA and a 3' EST ZZZZZZZZZ.
>
>    So our questions are:
>
>      * how many characters would be ideal ? 

Many ESTs in dbEST have an estimate of the insert size - so you could
gap them up to that size.

>      * what else could be used ?

You *must* keep the EMBL/GenBank accession number around so that the same
ESTs can be picked up in other databases (EMBL/GenBank, dbEST, UNIGENE,
RHdb (radiation hybrid EST maps) and so on).

>     * What are the ramifications of 
>
>	using NNN's (unassigned) :
>	or using ----'s (gap) :

Most sequence analysis packages will accept "N", but many will object to
a gap character - there is no clear consensus on what character to use
for gaps.

--
------------------------------------------------------------------------
Peter Rice                           | Informatics Division,
E-mail: pmr at sanger.ac.uk             | The Sanger Centre,
Tel: (44) 1223 494967                | Wellcome Trust Genome Campus,
Fax: (44) 1223 494919                | Hinxton, Cambridge, CB10 1SA,
URL: http://www.sanger.ac.uk/~pmr/   | England




More information about the Comp-bio mailing list