GenBank Errors

frist at ccu.umanitoba.ca frist at ccu.umanitoba.ca
Sun Oct 20 12:25:12 EST 1991


In article <2453 at fcs280s.ncifcrf.gov> toms at fcs260c2.ncifcrf.gov (Tom Schneider) writes:
>I've been suggesting solutions, such as named objects and merged entries, for
>10 years.  Is that long enough? 
Ironically, GenBank/EMBL/DDBJ did briefly introduce named objects, using
feature qualifiers such as "/label=<token>" which were very useful handles on
individual features. However, these were later removed (at least in
GenBank) and replaced with "/note=<text>" where <token> is some symbol
unique to that feature within the entry, and <text> is free form text. The
first is easily machine-parsable, the second is not.

I have been told that the use of labels in entries has been discontinued
because: 
(a) It is too demanding to expect software to be able to deal with them
(b) As the naming conventions change, feature names would have to change
(c) It is better to refer to features from another entry by absolute
coordinates (eg. X30405:11..238) rather than by label (eg.
X30405:magA_protein)

Having spent several years now writing programs to work with features, I am
absolutely convinced of the necessity for named features. The more I think
about the arguments above, the less they convince me.


(a) It is too demanding to expect software to be able to deal with them
Balderdash! (said in a light-hearted vein, okay? Here's a smiley to prove
it 8-)) I have written programs programs to parse any FEATURES expressions
and yes, it was hard, but it can be done. The FEATURES language is, in its
formal definition, very powerful, but it must be used in the database if we
are to take advantage of its true power.  The assumption in this statement
is that some subset of the language is easy enough to handle, such as
a simple base range. But if you work with a base range, you better be able
to handle such constructs as <11..238, one-of(10,11,12)..238, (10.12)..238
etc. Where do you draw the line? It is exactly for this reason that I wrote
GETOB, which is a general purpose program that can parse ANY legal FEATURES
expression and generate a file containing the resultant sequences. This
way, one program handles all of the parsing problems  and then the other
programs don't have to know anything about the FEATURES language. 

(b) As the naming conventions change, feature names would have to change
Well, that isn't perceived to be a problem with ACCESSION numbers. While it
may occassionally occur that mnemonically useful feature names may be made
obsolete through a change in nomenclature in a given field, this is no more
of a problem than any other aspect of nomenclature. For example,
photobiologists refer to the chlorophyll a/b-binding protein of thylakoid
membranes as cab, LHCP (light-harvesting chlorophyll  protein) or LHCP. In
most cases, these dual assignments are never resolved, and people learn to
live with them. It seems perfectly reasonable to me to use, in an entry, a
label that indicates what the authors of the publication reporting the
sequence called it, which means that you can always go back to the
reference to find out precisely what they were talking about (if they
themselves knew!).

(c) It is better to refer to features from another entry by absolute
coordinates (eg. X30405:11..238) rather than by label (eg.
X30405:lacZ_protein)
Again, I must disagree. If entries are merged, or errors such as insertions
or deletions introduced, or if new experimental data indicates a change of
the coordinates of the feature, then the coordinates must change. In all of
these cases, the feature label would still refer to the same feature.

One way in which  labeled features are important is in the maintaninence of
virtual databases.  For example, suppose you wished to maintain a database
of the 5'-most exons in a set of genes. You can' simply pull out every
'exon', but you could specify them with a file that looked like this:

M22396:exonI
X00198:exon-I
K76285:exon_I

which would retrieve the same features  from release to release, even if
the absolute coordinates of these features had changed, for example, due to
the addition of 5' sequence that hadn't been in the original entry. Another
thing worth noting is that such a mechanism even gets around
inconsistencies in naming of the feature from entry to entry. As long as
the feature name is kept, you can always refer to it, regardless of how the
nomenclature in the field may change. 

I realize that I have made my opinion on this subject known in the past,
but I consider it such a fundamentally important issue with respect to the
utility of the database that I felt it necessary to expand on these ideas
here.

===============================================================================
Brian Fristensky                | Conservation is getting nowhere because it is
Department of Plant Science     | incompatible with our Abrahamic concept of
University of Manitoba          | land. We abuse land because we regard it as a
Winnipeg, MB R3T 2N2  CANADA    | commodity belonging to us. When we see land 
frist at ccu.umanitoba.ca          | as a community to which we belong, we may  
Office phone:   204-474-6085    | begin to use it with love and respect.
FAX:            204-275-5128    | Aldo Leopold, 1948 A SAND COUNTY ALMANAC
===============================================================================



More information about the Bioforum mailing list