GenBank Errors

Tom Schneider toms at fcs260c2.ncifcrf.gov
Sat Oct 26 17:51:15 EST 1991


In article <1991Oct20.172512.12756 at ccu.umanitoba.ca> frist at ccu.umanitoba.ca writes:

>Ironically, GenBank/EMBL/DDBJ did briefly introduce named objects, using
>feature qualifiers such as "/label=<token>" which were very useful handles on
>individual features. However, these were later removed (at least in
>GenBank) and replaced with "/note=<text>" where <token> is some symbol
>unique to that feature within the entry, and <text> is free form text. The
>first is easily machine-parsable, the second is not.

OH MY GOSH DID THIS REALLY HAPPEN?  What do the folks at GenBank and NCBI have
to say about this?  Did you REALLY give up and DELETE these critical data?  How
could you do this?  Don't you understand how you are killing off certain kinds
of research?  If 90% of the people today reach into the database simply to
compare their sequence to what is there for matches, and you act accordingly,
then you will PREVENT anyone else from doing more interesting things with the
data!

>I have been told that the use of labels in entries has been discontinued
>because: 
>(a) It is too demanding to expect software to be able to deal with them

WHAT?  DISCONTINUED!!  Was the person who said this from the 1920's?  ANY good
programmer worth their salt could do it, and if they couldn't then they
shouldn't be working with computers except as a user.  Besides, once one person
has done it (eg: Brian!  Bravo!) then, if it is freeware, nobody else needs to
do it!  (C compilers are pretty widespread now, and I KNOW that Brian is
careful about portability.)  That argument bites the dust.

I can't believe this.  Really?  They said this?  oh my.

>(b) As the naming conventions change, feature names would have to change

Of course!  NOTHING is completely stable.  But if the names are genetic ones,
it is minimized, and synonyms can be supported.  It is a LOT better than the
LOCUS/ACCESSION names now!

BESIDES, LOCUS and ACCESSION names are built on an incredibly unstable model:
what humans have sequenced.  Wouldn't a better model for the database be what
is out there in nature?  Then the model becomes cleaner and cleaner with time,
everything merging into single huge sequences (as in nature).  What we have now
gets bigger with more messy overlaps and more horrible names.

>(c) It is better to refer to features from another entry by absolute
>coordinates (eg. X30405:11..238) rather than by label (eg.
>X30405:magA_protein)

There is a grain of truth here.  IF historical recording is insisted on (and it
looks like this is going to be shoved down our throats, without serious
discussion as to whether it is the best method or not; note that the discussion
has died the death of silence) then obscure names will be with us forever, and
there will be times when one will be FORCED to use these ugly names, just as we
are forced to use LOCUS names now.  Wouldn't it be nicer to say "organism
E.coli; chromosome pBR322; strain k736; get from gene amp start to gene amp
start + 50"?  "X30405: 11...238" is pretty obscure!

The original FORTRAN forced one to use GOTO's, which are well known to be
hazardous to a program.  The same is true about the reliance on LOCUS and
ACCESSION names.  They clearly have bad features.  Aside from the fact that
they exist, I have never heard cogent argument for keeping them.  Anybody up
for defending these atrocities?  :-) AT LEAST those of us who wish to look at
the database as a beautiful thing (someday) should be able to access it by
genetic and standard names.

>Having spent several years now writing programs to work with features, I am
>absolutely convinced of the necessity for named features. The more I think
>about the [lettered] arguments above, the less they convince me.

Bravo!  I agree with the posting.  I for one, had not heard your opinions on
names before.  The problem, Brian, is that the folks who are building the
database are not often in the position of trying to extract features.  So they
unintentionally (??) make it tough to do so.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov



More information about the Bioforum mailing list