Keeping GenBank/EMBL/DDBJ Software-Parseable
frist at ccu.umanitoba.ca
frist at ccu.umanitoba.ca
Sun Sep 15 16:20:07 EST 1991
I am not one to complain, as those familiar with my postings will
attest, and I am particularly hesitant to criticize those who
oversee the major databases GenBank, which, in my opinion, have
made great progress in recent years with the introduction of the
DDBJ/EMBL/GenBank Feature Table format. However, I percieve a
philosophical shift which to me undermines much of that progress.
BACKGROUND, FOR THOSE NOT FAMILIAR WITH THE FEATURES TABLE FORMAT
When the databases were first released using the new Features
format, this was perhaps the most significant step ever towards
making the data machine-accessible. In place of the
old, difficult-to-parse Features table was, in effect, a fairly
rigorously-defined programming language. For example an mRNA comprised
of three exons could be constructed by evaluating a very simple
The power of this language is that a single parsing
algorithm can create ANY feature! Each expression evaluates to a
I'll repeat that because it is the crux of the matter: "EACH
EXPRESSION EVALUATES TO A SEQUENCE"
Thus, the more features that are formulated as expressions, the
more features that can be retrieved by software. In other words,
the utility of the database is increased as more features are
encoded into the language.
RECENT CHANGES IN THE ENCODING OF FEATURES
The GenBank entry BRLTRPLE (M17892) illustrates the way in which
the utility of the database can be decreased by how annotation is
implemented. When the new Features format was first implemented, a
mutant containing two point mutations was defined in the expression
where r1 and r2 referred to two features, identified by the unique
labels r1 and r2, which contained the expressions
replace(207,"a") and replace(443,"c")
The utility of constructs like this has been realized in my program
GETOB, which is able to evaluate virtually any Feature expression
and return the resultant feature. For example, an expression such as
would retrieve the entire 934bp sequence, modified according to the labeled
Contrast the old construct with the way it is currently
/note="g in AJ23036; a in mutant 1041"
/note="a in AJ23036; c in mutant 1041"
I have several problems with this. First, the expression given
doesn't evaluate to anything new. 207..207 returns the WILD TYPE
allele at that position. Secondly, the mutant sequences have been
split up into two separate features. This is a step backwards from
the original Features philosophy, which sought to put together
previously disconnected things into single coherent features.
Additionally, the previous version made the feature accessible
through a /label= qualifier. /label has a specific meaning: it
is a tag within an entry that serves as a unique handle on that
feature. In contrast, /note= is almost meaningless. This qualifier
is for anything whose sole purpose is as a note to the human
reader. Most importantly, the relegation of the actual data
describing the mutation to a /note means that THAT DATA IS NO
LONGER MACHINE READABLE, because, in contrast to the rigorously-
defined Features expression, the note has no strict definition.
Consequently, you can't write software to access it.
How much data has been effectively lost from software-accessibility
because of re-structuring of entries in this fashion? I only
discovered this particular change while testing GETOB, in preparation for
publication of an article describing XYLEM. In effect, such changes
in database presentation, while still consistent with the Features
Definition, amount to fundamental changes in the database. How can
software developers write software with major upheavals such as
these occurring with no prior announcement? (It occurs to me that
I was supposed to be on a developer's mailing list. Do the
databases ever consult with developers? Many of us spend an awful
lot of time thinking about the database, and might have some useful
Many people have little appreciation for what an immensely useful
thing the Features language can be, partly because very little software
is yet available for utilizing this language. Over the last few
years, my XYLEM package has tackled exactly that, starting with the
old Features format, and evolving as the new Features format
matured. This evolutionary process, in fact, has been the main
reason that the publication of my paper on XYLEM has been delayed. The
recent additions to the Definition have been very easy to implement, being
logical extensions of the already-existing Definition. For example,
it took only a few hours to add the /codon_start qualifier to GETOB.
In contrast, writeing a routine to dissect information from a /note
is a horrendous task at best, and likely to produce an unreliable result.
I implore the database directors to keep moving in the
direction of getting as much of the data into a software-parsable
form. The databases can only reach their true potential when we can
minimize the need for human intervention in manipulating pieces of
data. Finally, don't develop these things in a vacume. Ask the
software developers, and perhaps even the user community in
general, what they would like the database to be. The bulletin
boards provide a vary simple mechanism for sounding out ideas and
getting input from a large number of minds.
Brian Fristensky | Conservation is getting nowhere because it is
Department of Plant Science | incompatible with our Abrahamic concept of
University of Manitoba | land. We abuse land because we regard it as a
Winnipeg, MB R3T 2N2 CANADA | commodity belonging to us. When we see land
frist at ccu.umanitoba.ca | as a community to which we belong, we may
Office phone: 204-474-6085 | begin to use it with love and respect.
FAX: 204-275-5128 | Aldo Leopold, 1948 A SAND COUNTY ALMANAC
More information about the Embl-db