Genbank Errors

Tom Schneider toms at fcs260c2.ncifcrf.gov
Thu Oct 24 18:00:15 EST 1991


In article <9110192347.AA21764 at genbank.bio.net> Peter.Rice at EMBL-HEIDELBERG.DE
(Peter Rice) writes:

Your posting only arrived today, October 24, 6 days later.

>2. Tom Schneider's original posting reported several overlaps between
>   published sequences (note: between published sequences, each of which
>   is represented as a separate entry in "GenBank"). The sequence databases
>   are not, and cannot be, responsible for the quality of published sequence
>   data, and could also argue a duty to report faithfully what is in the
>   original refereed publications.

They are, of course, not responsible for the published data.

>3. When citing entries in the sequence databases, *please* refer to them by
>   accession number though you can use the name too. I realize that most
>   biologists are happier to refer to the entry by a "meaningful name" but
>   the names can and often do change from time to time. Suppose, for
>   example, that a new naming standard changed the name of Tn5 or that
>   it was agreed to call the gene "kanamycin resistance" rather than
>   "neomycin resistance".

In the V00615 keyword listing is 'kanamycin resistance'.  It is in
the paper which describes the plasmid (Gene 7: 79, 1979), and we use
kanamycin on our plates.  Also, Sigma sells the kanamycin separately
from neomycin.  Are they the same?  Apparently not, Neomycin is
C_23_H_46_N_6_O_13.#H_2_SO_4, Kanamycin is C_18_H_36_N_4_O_11.H_2_SO_4.

>4. These Tn5 sequences have been reported at various times by several
>   different laboratories. Even when sequenced in the same laboratory
>   you cannot be certain that the sequences are from the same clone
>   or even the same original Tn5 isolate. They belong in separate entries for
>   many purposes. If you need a "merged entry" you should do it yourself
>   and draw your own conclusions about any conflicts, or use a specialist
>   database.

No, there should be a merged entry (or a view of the database if you like
historical sequence storage methods) which lists the conflicts and strain
differences.  This way the effort of making merges is not repeated by everyone
who merely wants a functional sequence.  Furthermore, when I want to do a
statistical analysis, I would have to do thousands of merges, and after a
certain point it will become impossible to do statistics on the database
because the merge task will become overwhelming.  Those of us who hop across
the database to do studies are being hindered not by the mass of data but by
the way it is being stored.  I do not worry at all that the human genome
project will flood us with data, I rather look forward to it!  I do worry that
it will be such a mess that the task of studying it will become impossible.

>5. The overlaps *are* described - in entry X01702 (TRNTN5STR in the posting,
>   but a different name in EMBL and DDBJ). The entry overlaps with
>   V00617 (TRN5IR2) and V00618 (ECOTN5X).

Of course they are described, but vaguely, not precisely.  I still had to do
the footwork to put the thing together.  There is no way to run a program to
get the merge.  So they could have been merged in the database, or a notation
could have been made as to how the entries should be merged, but that effort is
left to everyone who comes along.  It's equivalent to the situation 10 years
ago when everybody had to type the sequences in themselves.  The current policy
is causing is redundant efforts.

>8. As one of the collaborators in the ECD database, I do not want to see
>   merged entries in the database.

We need BOTH viewpoints.  Don't make the viewpoint you want prevent me from
using the data.  Computer science allows all views, but we have to be sure to
support them.

>Check the original papers!!!

But of course.  I know just as well as you how to resolve these things.  The
point is that here is (yet another) mess in the database which could have been
resolved a while ago and it wasn't.  Priorities have been to get pure sequence
in quickly, and that has been done well.  Now we must go on and get the
annotation straightened out.  THAT was what prompted me to post; we are far
from done on this job.

>>By joining these entries, the entire sequence of the tn5 transposon would be in
>>the database, and anyone wanting it would just grab it.

>But what is "the entire sequence of tn5" ??? How many copies of Tn5 are there
>in E. coli, or in other enteric bacteria, or wherever? What differences are
>there between them? What are the phenotypic effects of these differences?
>What is the sequence specificity of the insertion sites (i.e. the various
>sequences flanking Tn5)? All of these are important biological questions,
>and if GenBank/EMBL/DDBJ is to be used then it *must* reflect these
>differences.

Absolutely.  I suggest that a single sequence be stored, or that a standard
sequence view be made.  Then all the strain differences should be stored,
but probably in a compressed form.  Detailed notation in the database
would allow one to obtain any of the strains known.  At the point that
we can sequence all of E. coli, would you advocate storing an entire copy
of it for each known mutation in lacZ?  But what if the way I do my
work is to make mutations, and then sequence the entire genome?  10 years
from now this may be quite reasonable, and lots easier than cloning!
So do you advocate storing all that?  Surely we should store just the
changes.

>Even if you were to put the sequences together yourself, and publish them
>with a paper on the biological properties of Tn5, you may not have the
>sequence correct. As an example, a paper this year reported a massive
>sequence merge in E. coli around the two minute region, but one of
>the joins was at a restriction site where there remains the possibility
>that the two sites are not the same. Until someone sequences across the
>join, the merge remains "probably correct, but not proven".

If there are uncertainties in a merge, they should be appropriately stored in
the database.

>Fortunately the database entries are far more accurate than this example
>of hasty typing.

touche' :-)

>>Comparing TRN5NEO to the identically sequenced ECOTN5X, we see that some in one
>>case things are in features, the other, still in comments.  Inconsistent.

>In this particular case you will note from the entries that one came from
>GenBank and the other from EMBL in the days (a 1981 paper) before the close
>collaboration was necessary.

You mean this problem has been sitting around all these 10 years?  Wow.

>I know of many cases where over 20 closely similar sequences
>have been published in a single paper, and all have to be separate entries.

This seems to imply that the current database structure and/or entry mechanisms
cannot handle strain differences efficiently, and we are headed for a real
flood.  Isn't there a mechanism for listing the 50 differences between two
strains?  I'm pretty sure the new features table can (or could be made) to do
this.

>The gene names for E. coli, as it happens, are indeed being standardized.

YEA!

>So are the names for several other species. But where are the specialist curators
>who will do this for the other species?

Perhaps we need to widen the call for people to help us.

>Ah, a realization that it is international anyway.

Sorry, I use the single word GenBank to refer to the international database.
We need a single word to do this, so that we don't have to keep listing
everybody involved.

>>GenBank is:
>>  INCONSISTENT
>The scientific literature is inconsistent.

Yes.  But we should not leave it that way if we can help it.

>>  REDUNDANT
>
>Yes, and a good thing it is!!! If a sequence is determined ten times then it
>should be in the database ten times.

No, as I pointed out above, I think there should be some compression and the
ability to obtain any strain one wants through appropriate data in the entry
and software.  For those of us who want to do statistics across many genes
(e.g., all ribosome binding sites), having 10 copies of one thing and 2 of the
next hinders making the collection.

Also, as a general principal, it is not wise to have duplications or redundancy
in any kind of database, without tight links between them.  The reason is that
after a while corrections will be made to one but not the other (or the other
19!! wow!), and this leads to inconsistency.  I'm sure you would not appreciate
it if an airline changed your flight schedule for you, but when you got there
you found that the staff at the airport had not been alerted, because the
change was recorded in one database but not propagated to another.  "Sorry,
it's  not our fault, the computer did it."  I can imagine medical disasters
occurring 10 years from now for similar reasons.

So from a purely computational viewpoint, we should reduce the redundancy
in the database.  Understand this carefully!  I did NOT say to repress
strain differences; they also have to be provided for!  BOTH requirements
must be met.

>>  FULL OF ERRORS
>
>No!!! If it were full of errors it would be random sequence.

What I mean is that it is easy to find errors in an entry, and I demonstrated
it by finding a bunch more on the fly!  The sequence itself is probably quite
good, as the identity in Tn5 showed.  However, one possible resolution of the
case of the duplicate PBR322 entries could be sequence entry error.

>Of the "errors"
>that are there, sequencing errors are the responsibility of the original
>laboratory (including entries that have M13 left in). Differences between
>"overlapping" entries are interesting, and the opinions of individuals have
>not been subject to peer review.

Agreed.  Only clear cases of data-entry error can be resolved by the database
staff alone.  Others require the help of the labs involved.  My point is not
that the errors can be resolved or how to resolve them, but that there are a
lot of them, and people tend to brush this under the rug.  We really need those
labs looking at the sequences they are responsible for and contacting the
databases to resolve these problems.  Until resolved, the problem should be
recorded in the database, so it does not trip up the next person.

>And of course GenBank/EMBL/DDBJ is:
>CAPABLE
>CITABLE
>COLI-FRIENDLY
>COLLABORATIVE
>COMPREHENSIVE
>COMPUTERIZED
>CROSS-REFERENCED
>CURRENT

Yep!  (Although it is not always parseable... :-)

>>With an exponential growth, this is only going to get worse.  PLEASE if you
>>know of errors REPORT THEM ON THE NET FROM NOW ON so we can all see how bad the
>>situation is.
>
>NO!!! Report them to the update addresses (there is of course an
>update at EMBL-Heidelberg.DE address too).

Actually I agree with this, the reporting mechanisms are just fine.  What I
wanted to know was "how bad the situation is".  A number of people subsequently
reported that they have also found lots of errors.  That's what I wanted to
know.  It's not just my dumb luck to get the bad ones!

>If there is simply
>an overlap between two entries then it can be commented on in each entry but
>there is no point merging two entries from different laboratories - how could
>you then report a correction from one laboratory to the sequence within
>the overlap???

Well, since the overlap starts out with identical sequences, you would just say
that laboratory x reports that at position y the sequence was found to be z.
If the other lab agrees, that would become the canonical sequence.  In a merged
database one would know where all the original sequences came from, so would
know that there was a conflict.  Either the Biologist's or the Historian's
views can be implemented.  They are interconvertible.  The Biologist's
viewpoint has the advantage of fast searches (you don't repeat the search on 20
copies, and you make the program allow for the known strain differences).  The
Historian's viewpoint has the advantage of simplicity in data storage.  We need
to support both.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov



More information about the Bioforum mailing list