Keeping GenBank/EMBL/DDBJ Software-Parseable

frist at frist at
Sun Sep 15 16:20:07 EST 1991

I am not one to complain, as those familiar with my postings will
attest, and I am particularly hesitant to criticize those who
oversee the major databases GenBank, which, in my opinion, have
made great progress in recent years with the introduction of the
DDBJ/EMBL/GenBank Feature Table format. However, I percieve a
philosophical shift which to me undermines much of that progress.

When the databases were first released using the new Features
format, this was perhaps the most significant step ever towards
making the data  machine-accessible. In place of the
old, difficult-to-parse Features table was, in effect, a fairly
rigorously-defined programming language. For example an mRNA comprised
of three exons could be constructed by evaluating a very simple


The power of this language is that a single parsing
algorithm can create ANY feature! Each expression evaluates to a

I'll repeat that because it is the crux of the matter: "EACH

Thus, the more features that are formulated as expressions, the
more features that can be retrieved by software. In other words,
the utility of the database is increased as more features are
encoded into the language.

The GenBank entry BRLTRPLE (M17892) illustrates the way in which
the utility of the database can be decreased by how annotation is
implemented. When the new Features format was first implemented, a
mutant containing two point mutations was defined in the expression

variation      group(r1,r2)

where r1 and r2 referred to two features, identified by the unique
labels r1 and r2, which contained the expressions

replace(207,"a")    and     replace(443,"c")

The utility of constructs like this has been realized in my program
GETOB, which is able to evaluate virtually any Feature expression
and return the resultant feature. For example, an expression such as

would retrieve the entire 934bp sequence, modified according to the labeled

Contrast the old construct with the way it is currently

     variation       207..207
                     /note="g in AJ23036; a in mutant 1041"
     variation       443..443
                     /note="a in AJ23036; c in mutant 1041"

I have several problems with this. First, the expression given
doesn't evaluate to anything new. 207..207 returns the WILD TYPE
allele at that position. Secondly, the mutant sequences have been
split up into two separate features. This is a step backwards from
the original Features philosophy, which sought to put together
previously disconnected things into single coherent features.
Additionally, the previous version made the feature accessible
through a /label= qualifier. /label has a specific meaning: it
is a tag within an entry that serves as a unique handle on that
feature. In contrast, /note= is almost meaningless. This qualifier
is for anything whose sole purpose is as a note to the human
reader. Most importantly, the relegation of the actual data
describing the mutation to a /note means that THAT DATA IS NO
LONGER MACHINE READABLE, because, in contrast to the rigorously-
defined Features expression, the note has no strict definition.
Consequently, you can't write software to access it.

How much data has been effectively lost from software-accessibility
because of re-structuring of entries in this fashion? I only
discovered this particular change while testing GETOB, in preparation for
publication of an article describing XYLEM. In effect, such changes
in database presentation, while still consistent with the Features
Definition, amount to fundamental changes in the database. How can
software developers write software with major upheavals such as
these occurring with no prior announcement? (It occurs to me that
I was supposed to be on a developer's mailing list. Do the
databases ever consult with developers? Many of us spend an awful
lot of time thinking about the database, and might have some useful

Many people have little appreciation for what an immensely useful
thing the Features language can be, partly because very little software
is yet available for utilizing this language. Over the last few
years, my XYLEM package has tackled exactly that, starting with the
old Features format, and evolving as the new Features format
matured. This evolutionary process, in fact, has been the main
reason that the publication of my paper on XYLEM has been delayed. The
recent additions to the Definition have been very easy to implement, being
logical extensions of the already-existing Definition. For example, 
it took only a few hours to add the  /codon_start qualifier to GETOB.
In contrast, writeing a routine to dissect information from a /note
is a horrendous task at best, and likely to produce an unreliable result.

I implore the database directors to keep moving in the
direction of getting as much of the data into a software-parsable
form. The databases can only reach their true potential when we can
minimize the need for human intervention in manipulating pieces of
data. Finally, don't develop these things in a vacume. Ask the
software developers, and perhaps even the user community in
general, what they would like the database to be. The bulletin
boards provide a vary simple mechanism for sounding out ideas and
getting input from a large number of minds.

Brian Fristensky                | Conservation is getting nowhere because it is
Department of Plant Science     | incompatible with our Abrahamic concept of
University of Manitoba          | land. We abuse land because we regard it as a
Winnipeg, MB R3T 2N2  CANADA    | commodity belonging to us. When we see land 
frist at          | as a community to which we belong, we may  
Office phone:   204-474-6085    | begin to use it with love and respect.
FAX:            204-275-5128    | Aldo Leopold, 1948 A SAND COUNTY ALMANAC

More information about the Embl-db mailing list