Genbank Errors

Peter Rice Peter.Rice at EMBL-HEIDELBERG.DE
Sat Oct 19 18:23:00 EST 1991


I have a (large) number of comments on Tom Schneider's (still unretracted)
posting.

First, let me point out that I am two floors away from the EMBL Data Library
staff and am writing as a database user with a special interest in E. coli
sequence entries. Perhaps someone from the EMBL Data Library staff will join
this discussion too.

OK, now for the detail. Surely inflammatory postings deserve flames...

1. The discussion appears to be an all-american flame war against GenBank.
   The data in GenBank is actually entered and maintained by three collaborating
   international database groups. The data is transferred between these groups
   to save enormously on the effort and manpower required. At least two of the
   entries referred to were originally entered by EMBL. Other entries may
   come from the DNA Data Bank of Japan.

2. Tom Schneider's original posting reported several overlaps between
   published sequences (note: between published sequences, each of which
   is represented as a separate entry in "GenBank"). The sequence databases
   are not, and cannot be, responsible for the quality of published sequence
   data, and could also argue a duty to report faithfully what is in the
   original refereed publications.

3. When citing entries in the sequence databases, *please* refer to them by
   accession number though you can use the name too. I realize that most
   biologists are happier to refer to the entry by a "meaningful name" but
   the names can and often do change from time to time. Suppose, for
   example, that a new naming standard changed the name of Tn5 or that
   it was agreed to call the gene "kanamycin resistance" rather than
   "neomycin resistance".

4. These Tn5 sequences have been reported at various times by several
   different laboratories. Even when sequenced in the same laboratory
   you cannot be certain that the sequences are from the same clone
   or even the same original Tn5 isolate. They belong in separate entries for
   many purposes. If you need a "merged entry" you should do it yourself
   and draw your own conclusions about any conflicts, or use a specialist
   database.

5. The overlaps *are* described - in entry X01702 (TRNTN5STR in the posting,
   but a different name in EMBL and DDBJ). The entry overlaps with
   V00617 (TRN5IR2) and V00618 (ECOTN5X).

6. The international collaboration began some years ago, but for a few years
   the three databases were separate and had differences in their definitions
   of some features. One of these was "CDS" which included the stop codon in
   one database's original entries but did not in another's. As entries were
   converted, the effort was put into reading as much of the feature table
   as possible, standardizing journal names, and so on. Any conflicts were
   often resolved by putting the features into comments instead of features.
   The databases now have a larger and more sophisticated feature table which
   allows them to distinguish between "CDS" as a coding region and for
   example "mat_peptide" as the region which codes for the final protein
   product. For *new* entries the use of the stop codon is now defined and
   agreed. For old entries, they stay as they were until a specialist
   curator can go through and correct them accurately. For example, I
   recently found a CDS that could not be translated due to an "extra"
   base. Checking with the original publication showed the sequence to
   be correct but a ribosomal frameshift was omitted from the feature
   table. At the time the entry was annotated, there was no way to easily
   represent such a process, and the concept was probably unfamiliar to the
   annotator in those days.

7. If you would like to know about overlaps between entries for E. coli
   sequences, then there are already two databases in existence that you
   can use. One is ECD from Manfred Kroeger, the other EcoSeq/EcoMap from
   Kenn Rudd. Both are described in the pull-out section of the Oct-11th
   Science issue.

8. As one of the collaborators in the ECD database, I do not want to see
   merged entries in the database. Even the merges of the ara and lac
   operons are a hindrance rather than a help. The original ara work was
   not done in a K12 strain so many sequence conflicts reflect strain
   differences of great interest to the E. coli specialists. Even E. coli
   K12 strains have major differences (I classify having 20% of your
   chromosome inverted as major :-).

Now for a few cuttings out of the original posting...

>TRN5IR2 is mostly internal to TRN5IR1, with some base changes and then
>about 67 bases different on one end.  Yet they are both supposed  to be
>tn5.  I have not tracked down the source of this discrepancy.  I think it is
>outside the transposon.

Check the original papers!!! As you must surely see from the feature tables of
these two entries, the right inverted repeat of Tn5 is from 67 to the end.
The entries do not at all suggest that the remaining 66 bases should match.
After all, transposons do transpose, and could have been sitting almost
anywhere when they were caught. So of course the two entries differ.

>By joining these entries, the entire sequence of the tn5 transposon would be in
>the database, and anyone wanting it would just grab it.

But what is "the entire sequence of tn5" ??? How many copies of Tn5 are there
in E. coli, or in other enteric bacteria, or wherever? What differences are
there between them? What are the phenotypic effects of these differences?
What is the sequence specificity of the insertion sites (i.e. the various
sequences flanking Tn5)? All of these are important biological questions,
and if GenBank/EMBL/DDBJ is to be used then it *must* reflect these
differences.

Even if you were to put the sequences together yourself, and publish them
with a paper on the biological properties of Tn5, you may not have the
sequence correct. As an example, a paper this year reported a massive
sequence merge in E. coli around the two minute region, but one of
the joins was at a restriction site where there remains the possibility
that the two sites are not the same. Until someone sequences across the
join, the merge remains "probably correct, but not proven".

>I keep getting blubber from people who say that "Oh, we can handle that, we'll
>just have the entries separate and we'll provide you with a view of the data
>that is merged".
>
>Well, get on with it.  So far it's hot air.

No it is not. See the above comments on merged E. coli operon entries.
See also the E. coli databases. See also the GenBank postings on their
curator program, and the EMBL Data Library's Affiliated Data Units.

>Oh yes, in TRN5NEO, the end of the neomycin phospohtransferase gene is at 945.
>(which would be the A of the TGA)
>TRNTN5STR says that the end of the kananycin phosphotransferase
>(nb, the same gene as above, isn't it??)
>is at position 45, the A of TGA.  Thus the two are inconsistent.
>HOWCOME THIS WAS NOT DETECTED BY A PROGRAM??????

Fortunately the database entries are far more accurate than this example
of hasty typing. When you criticize publicly, please take care over
your spelling (phosphotransferase, kanamycin). Please also be careful with
the tone of your comments, otherwise you invite nit-picking such as "nb should
be N.B." or "a gene ends when a mutation does not cause a specific phenotypic
effect, a coding sequence ends at the end of the stop codon". The comments are
unclear, but I assumed in my reply above that the question is why the first
entry records 942 as the end rather than 945.

>Comparing TRN5NEO to the identically sequenced ECOTN5X, we see that some in one
>caase things are in features, the other, still in comments.  Inconsistent.

In this particular case you will note from the entries that one came from
GenBank and the other from EMBL in the days (a 1981 paper) before the close
collaboration was necessary. The EMBL entry clearly did not match the existing
GenBank one, so it was placed in a new entry. There were probably many reasons,
but just as a first guess look at the source lines. One was recorded as
E. coli and the other as transposon Tn5 so without detailed checking of the
literature and consultations between the databases it is far safer to create
the new entry. I know of many cases where over 20 closely similar sequences
have been published in a single paper, and all have to be separate entries.
The features may have been commented due to the "end-of-CDS" conflict, or as
general policy on such entry clashes.

>Also, the name of the gene is in a NOTE.  Put the names in something other than
>notes and comments so we can read them with programs!!  I've been saying this
>for 10 years and the GenBank staff has STILL not gotten it through their thick
>skulls.  No wonder they lost the contract.  Will you do any better David
>Lipman?

The gene names for E. coli, as it happens, are indeed being standardized. So
are the names for several other species. But where are the specialist curators
who will do this for the other species?

>Sorry.  This is an international disaster and nobody cares.

Ah, a realization that it is international anyway.

>GenBank is:
>
>  INCONSISTENT

The scientific literature is inconsistent. GenBank/EMBL/DDBJ may have some
historical oddities in the feature tables, but these should be checked by
a specialist and not simply converted according to a simple algorithm. To
err is human, but to really foul things up you use a program. Computers
have a similar line: to err is a feature of the algorithm, but to really
foul things up needs human input. In both cases, the results need to be
carefully checked.

>  REDUNDANT

Yes, and a good thing it is!!! If a sequence is determined ten times then it
should be in the database ten times.

>  FULL OF ERRORS

No!!! If it were full of errors it would be random sequence. Of the "errors"
that are there, sequencing errors are the responsibility of the original
laboratory (including entries that have M13 left in). Differences between
"overlapping" entries are interesting, and the opinions of individuals have
not been subject to peer review.

And of course GenBank/EMBL/DDBJ is:

CAPABLE
CITABLE
COLI-FRIENDLY
COLLABORATIVE
COMPREHENSIVE
COMPUTERIZED
CROSS-REFERENCED
CURRENT

and those are just the "C"s from my list of adjectives.

>With an exponential growth, this is only going to get worse.  PLEASE if you
>know of errors REPORT THEM ON THE NET FROM NOW ON so we can all see how bad the
>situation is.

NO!!! Report them to the update addresses (there is of course an
update at EMBL-Heidelberg.DE address too). *If* they are genuine errors by the
database staff they would be corrected. If they are suggestions that a sequence
is not correct, or if the entry annotation was provided by the authors, then
the problem will be passed on to the authors' laboratory. If there is simply
an overlap between two entries then it can be commented on in each entry but
there is no point merging two entries from different laboratories - how could
you then report a correction from one laboratory to the sequence within
the overlap???

I reported an overlap to GenBank (sorry EMBL, but both entries had GenBank
accession numbers so I went direct :-) which they passed on to the
authors as the sequences were from different laboratories. Details were on
the INFO-GCG list a few months back for those interested.

 -----------------------------------------------------------------------------
 Peter Rice, EMBL                             | Post: Computer Group
                                              |       European Molecular
 Internet:    Peter.Rice at EMBL-Heidelberg.DE   |            Biology Laboratory
 EARN/Bitnet: RICE at EMBL.bitnet                |       Postfach 10-2209
                                              |       W-6900 Heidelberg
 Phone:   +49-6221-387247                     |       Germany

*** Warning: my bitnet address expires at Christmas. Please use Internet ***



More information about the Bioforum mailing list