GenBank Errors

Tom Schneider toms at fcs260c2.ncifcrf.gov
Wed Oct 23 15:34:29 EST 1991


In article <9110212016.AA01487 at motif.cshl.org> kumar at CSHL.ORG
(Sanjay Kumar at Cold Spring Harbor Lab) writes:

> As Paul (pgil at histone.lanl.gov) pointed out:
>> However, with the advent of automated submission tools that are
>> now a reality we can begin to capitalise on this by providing
>> that same level of data integrity checking AT THE LAB BENCH.
...
>If software
>was available for identifying possible errors in data *PRIOR* to submission,
>the database curators would be helped.
>potential errors could and should be rechecked by the scientist.  This is
>no different than with any other data that goes into a paper.  Such a data
>validation program should be able to flag possible sequencing errors, 
>inclusion of vector sequence, as well as inconsistencies relating to merging
>entries.  A server-based program would be valuable.  

Interesting idea.  I like the idea that if the user is asked for some key
words, the program could go ask IRX for other relevant entries.  The scientist
might realize, for example, that they have just filled in a sequence between
two other ones, and therefore could indicate that their sequence should be
merged with the adjacent ones.  Thus we would be all actively building the
great structure.

>Tom Schneider (toms at ncifcrf.gov) wrote:
>> I've been suggesting solutions, such as named objects and merged entries, for
>> 10 years. 
>How about *detailing* those suggestions here so everyone could evaluate the 
>proposed solutions?  

Perhaps a little history is also appropriate on my part.  In 1979, before
GenBank was born, I started a project to search for Shine/Dalgarno like
sequences in ribosome binding sites.  I typed in the sequences around the start
of 63 genes into a simple file.  As I was doing this I realized that if I
wanted to look at a different range around the starts, I'd have to do a lot of
editing.  I was likely to make errors, and mess up.  So I embarked on a project
to allow me to type in the entire known sequence and then to extract the
fragments I desired.  We (Jeff Haemer, Gary Storm and I in Larry Gold's lab in
Boulder Colorado) called the collection of sequences a 'library', and so the
extracted fragments were named a 'book'.  Clearly there had to be a librarian
program, and "her" name became Delila, since I woke up one day and wrote down:

  DEoxyribonucleic-acid
    LIbrary
      LAnguage

With Delila, one writes down the fragments one wants to study, and "she"
extracts the.  In keeping with good programming practice, which says that
---like UNIX--- it is best to make small tools that do their job well, this is
all Delila should do.  Other "auxiliary" programs then process the data in
various ways.

Delila raised lots of interesting issues about sequence databases.  To my
surprise, they are just as relevant now as they were 10 years ago.

Some of the issues:
  DEFINITION
  NAMES
  MERGES
  COORDINATES
  DATABASE STRUCTURE

DEFINITION: 
I spent about 6 months designing the database, BEFORE I wrote any code.  The
final document is called LibDef (Library Definition), and like all the code, it
is available by anonymous ftp from ncifcrf.gov in pub/delila.  I did this
because I knew that if I just leapt right in and started programming I'd have a
mess.  (I should say that I do not imagine that Delila should replace GenBank.
Delila has its own problems.  For example, I designed it before I was aware of
introns, and I didn't know they would be in almost every eukaryotic gene, so
Delila as it stands cannot handle these.  It was never designed or defined to.
Rather, I use Delila as an example for thinking about the problems of
databases.)

I stuck to a rigorous discipline, which said that if I wanted to change
anything in the structure of the database, I would first make a change to
LibDef, then change Delila and the libraries.  This allowed me to keep track of
what I wanted and keep the documentation up to date.  It also forced me to
think hard about the implications of each change.

GenBank has some documents about the database.  However, so far as I know,
there is no document that completely defines what is to be stored in GenBank,
and how it is to be organized.  GenBank just is.  I believe, and have said
repeatedly over the years, that we would benefit by having a Definition of
GenBank.  We could then all argue over this document and from this would emerge
a better definition.  It should always be flexible, but obviously there has to
be a process for change, and probably only a few or one person who actually
makes the changes.

Since so many people use it, the defintion should also contain the REASONS for
the design choices, so that other people can support or object to them.  This
is the philosophy of the database, and it is very important for making
progress.

You will see that LibDef contains a Backus-Naur Form (BNF) for the database
structure and for the Delila language.  These define the syntax of the
database.  Recently several people have attempted to parse the GenBank
structure, and ran into trouble because the structure was not adhering to a
definition.  If people can't write code that reads cleanly through the
structure, then we can't get at the data in there.  I suppose that those
particular problems were removed, but they would never have occured if there
was a defintion.  Also, these incidents are no guarrantee that the next parser
will work, or even that the parsers won't suddenly break when someone at
GenBank makes an innocent change that is not backed up by a definition.

As another example, there is no document that says that every genetic name WILL
be in the database when known, and recorded as such.  Thus LacZ is recorded as
a gene, but LexA is not.  Besides being inconsistent, this means that I can't
reliably write instructions to get genes from the database.  It is amazing to
me that such fundamental information is not being recorded.

We need a formal and complete definition of the database.

NAMES:
It took a me a number of years as GenBank advisor to convince people that there
should be the ability to store names of objects in the database.  We now have
the ability, but since there is no definition, there is no commitment or
requirement to actually use it, as the LacZ/LexA example shows.

What does this mean?  A high school student, Mike Stephens joined me a few
years ago.  Because of the duplications in the database, he worked for about 8
months to create a clean list GenBank entries with splice junctions in them.
The final product is a file of Delila instructions that begin like this:

   title "IVS beginning from -50 to 50";
   organism Homo.sapiens; chromosome Homo.sapiens;

   piece HUMA1ACMB;
   get from 458 -50 to 458 +50;
   
   piece HUMA1AR1;
   get from 814 -50 to 814 +50;

The title defines the name of the book Delila will create.  The next line
defines the species.  Delila has to know the chromosome, but it is possible
that if the genetic names were used, this could be omitted in this case.
("chromosome" is currenttly used to distinguish between the main chromosome of
E. coli and elements like F or plasmid or transposons.)

The piece of DNA is then defined.  Since we use GenBank data, this is the LOCUS
name.  The instruction to get from 50 bases before base 458 to 50 bases after
458 is then written.  If GenBank always had a gene name, we could have written:

    gene ACMB;

The LOCUS name changes at the whim and needs of GenBank.  Thus our instruction
set has already fallen out of date!  We looked again this summer and found that
we would have to spend many months repeating the same effort we made before.
If names in GenBank were the standard genetic ones, we would have a much more
stable instruction set.  Yes, it would change in time.  But not anywhere nearly
as fast.

For example:

title "lacZ gene";
organism E.coli; chromosome E.coli;
gene lacZ;
get from gene begin to gene end;

should work 100 years from now!  Accession numbers are horrible to remember.  As
biologists, we should be able to reach into the database the way we think about
it.  We do NOT care about the history of the sequencing efforts.  (Note that as
I said elsewhere, we do need that information also.)

So the idea is that every genetic object in the database should have its
standard genetic name.  This should be part of the definition of GenBank, so it
is there for everybody to see and either agree or disagree with.  We can have
synonyms if you want, but there should always be one primary standard name.

DATABASE STRUCTURE
One point which Jeff Haemer (jsh at ico.isc.com) and I realized early on is that
the output of Delila should be identical in form to its input, the book looks
like the library.  This meant that a program which works on a small subset of
the database will work just fine on the entire database.  One implication is
that the main database should have a coordinate system.  Obviously one way to
handle this is if there is no explicit coordinate system given, then it is
assumed to be L(1 n).

The advantage of extraction is that the analysis programs need only deal with a
small amount of data.  There are issues about how one may want extractions from
a relational database.  Should it be relational also?  Maybe sometimes.  Should
it still have links to the big database?  These are issues for the future.

When sequences are merged, as discussed below, the names must not be
ambiguous.  Fortunately geneticists are already making them consistent across
each organism.  As things are now in GenBank, a merge would leave duplicate
names.  Delila is sensitive to things like this.  Years ago, we extracted a
subset of our database which represented mRNA.  We then made this a library and
extracted the ribosome binding sites.  This guarranted that each ribosome
binding site sequence is from mRNA, which was important for training the
Perceptron (a primitive neural net).

But if several sequences are extracted from the same piece, they all inherit
the same name, and the library has duplicate names.  The catal program is used
to add numbers to the ends of names to make them unique.  This is a difficult,
only partially solved problem.

The point of this section is that the issues of database structure, use nnames
and merges are all intertwined.  How we design a database strongly influences
what we can do with it.  If the design is not sufficient, then we simply cannot
do certain things.  Some design changes open enormous possibilities without any
hinderance to others.  Merging is an example of this.

MERGES

One can put on different hats when one does things.  As a biologist, I want the
complete known sequence of Tn5, so I can get on with designing a PCR primer for
our research.  As a historian I would want the original papers.  As a computer
scientist, I would be concerned that both views are possible.  The viewpoint of
the biologist is not being supported by GenBank!  Who is the data base for
anyway?  :-)  Clearly we need both views, and so it becomes a technical
problem how this is implemented.  Two methods are possible:

1. Biologist:  Physically merge the data and keep a careful record that allows
one to reconstruct the original data.

2. Historian:  Keep the data separate and make a careful record that allows one
to construct the merged data.

These both have consequences.  The Library of Medicine is interested in the
second method, as would make sense for a library.  For retrospective analysis,
as one person pointed out, one will want the original data.  The question is,
which way would the data be used most frequently?  I claim that the Biologist's
approach will far outweigh the Historian's, and that as time goes on, fewer
people will care about the history, just as we rarely care to recall who
discovered a gene or named a species.

Technically, there is a difference.  If the Historian's approach is made, then
to merge the entire E. coli genome will require quite a bit of computer time,
each time it is wanted.  The alternative is to generate a duplicate, which
means that if someone forgets to updated the duplicate, the database falls out
of date or duplications have to be done continuously...  (Or worse, changes
get made on the duplicate, and lost on the next automatic update!)

With the Biologist's approach, most uses of the database are immediately
available, with no computation.  To get an original sequence, you run (an
updated version of) Delila.  I therefore advocate the Biologists solution.

There is another consequence of the Historian's approach.  If the software is
not done perfectly, inconsistencies could float around in the database.  For
example, I might ask for a view of Tn5 and get one sequence, but because of a
bug in the code, get a different one by asking a different way.  (Example:  two
different merge programs might exist and not do the same thing.)  The
Biologist's solution avoids this entirely by keeping the data in the form it is
going to be used most often.  Notice that both approaches require a commitment
to do the merges.  I fear that if the Historian's approach is taken, this
commitment will be allowed to slide, and the Biologist will be unable to work
efficiently.  This is indeed the situation today.

COORDINATES

Suppose I have extracted LacZ using the Delila instructions given above.
Unlike GenBank entries, Delila databases were designed for computer access, not
human readability, so I run the Lister program to make a nice listing of it.
(Readability is yet another issue...)  Now suppose I extract the region around
the ribosome binding site:

title "lacZ gene start";
organism E.coli; chromosome E.coli;
gene lacZ;
get from gene begin -60 to gene begin +40;

This is a realistic Delila instruction that works today.

I run the lister program on this and get a numbered listing.  Now I want to
compare the two listings.  If you have ever done something like this you will
know how frustrating it can be to keep subtracting and adding to compare the
sequences.  To avoid this, I defiined to Delila keep the original coordinates
of fragments.  This turns out to be pretty tricky to do because Delila allows
one to get the complementary sequence:

get from gene begin +40 to gene begin -60 direction complement;

Also, some sequences are circular, so the numbering can have funny jumps.
These problems were solved and are implemented in Delila.  So the output from
ANY auxiliary program can be directly compared to the output from any other and
there are no headaches.

Because GenBank only stores sequences, and does not use subfragments of them,
it is natural for them to number the sequences 1 to n.  If two sequences are
merged, then the merged sequence is renumbered 1 to n, which is going to cause
some heads to hurt!  Further, Mike's Delila instructions were FORCED to be
ABSOLUTE, ie of the form:

   get from 458 -50 to 458 +50;

because we could not use names from which RELATIVE instructions could be
written (as in the lacZ example)!  So if two entries are merged, or a sequence
correction is made - deletion or insertion - then AN ABSOLUTE INSTRUCTION IS WRECKED!
Clearly it is not possible to maintain the same numbering forever because
sequences merge.  But within a short time span, sequence numbering is the
easiest way to get around, so we need it.

Let me repeat:

Because GenBank does not have names, we are forced to use absolute
ccoordinates.  Absolute coordinates are unstable.  Instructions which we write
are therefore unstable, and our hard earned effort is lost over time.  Worse,
if give the list to other people because it won't work for them either.

When Delila extracts sequences, they have the same numbering as the original
database.  This means that there has to be a mechanism for defining the
numbering of a fragment.  This does not exist in GenBank.  Implementation is
not so hard though: when Delila runs, a coordinate datatype can be added.

The way coordinates are implemented by Delila today is not good because it
cannot handle insertions and deletions.  Consider, for example, that one would
like to extract the mRNA of lacZ with a particular deletion for the purpose of
studying the potential RNA structures of that mutation.  We don't want to do
this by hand because we would make mistakes, and besides, we plan to do 500
such studies...  An advanced Delila, working with GenBank data, would allow one
to say something like:

title "lacZ mRNA with mutation xyz";
organism E.coli; chromosome E.coli;
gene lacZ;
get all transcript with mutation xyz;

To stick to the idea that all output should have consistent numbering
(to, for example, compare the mutant sequence to the original!) we need
a scheme that can handle deleted sequences.  Here is one:

L(5 50)(200 5000) (8 -20)

It means that there is a Linear (L) sequence (could be Circular or perhaps
Repeat), the first base is 5, and runs through 50, followed by base 200 through
5000, followed by bases 8 to -20 in decreasing order.

This could be implemented today if someone has the time and energy.

To summarize, the issues of names and merging are tangled with the
possibilities of using computer languages like Delila and its progeny such as
DNA STAR (TM?, by Fred Blattner) to manipulate the sequence database.  To
support these languages the database must have certain properties.  As the
database grows, it will become critical to have access through languages
because we won't be able to deal with the data any other way.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov



More information about the Bioforum mailing list