Beyond GenBank.

Keith Robison robison1 at
Wed Dec 16 18:19:15 EST 1992

In article <BzByvA.E7M at> toms at (Tom Schneider) writes:
>In article <BRIANF.92Dec14175130 at> brianf at
>(Brain Foley) writes:
>>	Thus there is a wealth of information about DNA sequences that
>>is not getting into GenBank.
>YES!  This is why I have the middle name I do.  This huge loss of data is going
>to catch up with us sooner or later (I'd say within 5 years) and people will
>not be able to track down all the data.  It's just about impossible NOW!  Can
>Sankar Adhya - world renouned expert on CRP - give me a current list of all
>known sites (ie experimentally proven - none of this garbage prediction
>stuff)?  NO!  What about IHF, FIS, splice junctions, etc?  The data is being
>lost into our huge literature at a horrible rate.  (Tell me, are you willing to
>slog through 6000, yes SIX THOUSAND splice junctions for humans to remove all

	Based on my own experience, I would say that it is unlikely that
there are probably less than 100 errant splice junctions in all of GenBank, 
though there are also a large number which cannot be identified unambiguously,
even experimentally (at least by no method I know of).  These occur
when the intron site falls in or adjacent to one or more G's in the
exons.  Anyone who would like a list of all suspect junctions, I will have
a list right after Christmas (when GB74 is installed).

>But NCBI refuses to acknowledge and deal STRONGLY with this growing disaster,
>and until more people recognize the problem (as you did, congratulations!) it
>will only become worse.  ("Strong" means to have a policy in effect which
>will lead to the assured capture of all experimentally supported data
>on sequence features.  Guesses should not be in the database.)

Guesses shouldn't be -- predictions should be but marked as such.
Only then can other people retest those predictions using new schema.
Tom, if we took your suggestion literally, our protein databases would
be nearly empty as most known protein sequences are derived from DNA 
sequences with the ASSUMPTIONS that the DNA sequence is accurate and
no monkey business (translational frameshifting, ribosome hopping, 
RNA editing, etc) is going on.   I am currently involved in a project
which is finding a significant (>30) number of possible exceptions to these
assumptions; of course to do this we are using a new set of assumptions.


>LOCUS name changes at the whim of the moment, and is no longer meaningful
>for any long term storage.  It really should be eliminated.
>ACCESSION number keeps being updated and altered.  At least the old one is
>there, but there are not ABSOLUTE names in sequences to which one can refer.
>So the locations of mutations are FORCED by this (excuse me) stupid system to
>be numbers.  That means that when the sequence numbering changes (eg, merge of
>sequences - which HAS to be done!!! or error correction) the pointers in the
>secondary database will be blown away.  There is no assured mechanism to give
>coordinates RELATIVE to standard marks in the sequence.  (The folks who
>followed last years row on this subject will see that the ideas never change.
>The database doesn't improve.  Only the date changes and the time to disaster
>There isn't anything else to hang references on.  NCBI is unwilling to face
>these tough issues.  It's that time of year folks.

       For one's own use there IS a way to do this -- record a fragment of
the feature as a Sequence Tagged Site in your database.  I don't know 
if the databases should really use this scheme, but with tools such
as BLAST it does work on a local scale.

	The better way to do this on a grand scale if for the
database to consist of a set of changes rather than as a set of data.
If you record the changes in such a way that the evolution of the database
can be precisely traced forward OR backward from ANY point in time, then
you can directly update old cross-references to conform with the current

>Dear David Lipman, I'm sorry to say: no.
>Let's discuss this on the net David, where everybody can "listen".

Hear, Hear!

>  Tom "Cassandra" Schneider
>  National Cancer Institute
>  Laboratory of Mathematical Biology
>  Frederick, Maryland  21702-1201
>  toms at

Keith Robison
Harvard University
Department of Cellular & Developmental Biology
Department of Genetics / HHMI

robison at 

More information about the Biomatrx mailing list