GenBank Errors

L.A. Moran lamoran at gpu.utcs.utoronto.ca
Sun Oct 20 13:50:45 EST 1991


The Genbank database contains quite a few redundant records and many of the
records contain errors. Furthermore there are significant differences in
the quality of individual records - some are fully and informatively
annotated and others are not.

In my experience it is wrong to blame GenBank for most of these shortcomings
because it is usually the authors who have submitted a sequence with errors
or with inadequate information. I do not think that it is reasonable to
expect GenBank to police submissions or to expect a higher level of accuracy
than that expected by the journals. (Most of the errors that I have detected
are also in the published sequences.)

As a thought experiment let us assume that we need to "clean up" the database.
Here are some problems for you to consider. I believe that they illustrate
the difficulties involved and the kinds of decisions that are required in
order to improve the situation. What would YOU do?

1. There are three sequences of the yeast BiP gene in the database. The
   sequences are almost identical but there are some differences between
   the three labs. It seems desirable to merge these three records but
   how does one handle the differences? Should we go with the consensus
   in every case?

2. There are several examples of the B. subtilis dnaK gene. Two of these
   are complete sequences but they are very different. Who's correct?
   How could we merge these two records? Should the differences be noted
   in the two separate records, and if so who will do this?

3. The sequence of the human hsp70A1 gene is incorrect in the database. It
   contains a large transposition of sequence that was also in the original
   PNAS publication. When GenBank is informed of this by a third party should
   they immediately alter the record or should they consult with the author
   for conformation? What if the author does not respond? The important
   questions here is who do you believe, and who has the right to alter
   the database?

4. One of the Leishmania hsp70 sequences is obviously full of errors because
   there isn't even an open reading frame. Yet the sequence in the database
   is the same as the one that was published and the publication claims that
   it is a functional gene. Should GenBank insert a comment that states that
   the sequence may be full of errors, over the objection of the authors?

5. The database contains an early version the the E. coli dnaK gene that was
   submitted by the authors. There is also a later more correct version of 
   the same sequence that was published and also submitted by the authors.
   Subsequently additional corrections to this sequence have been published
   but not entered in the database. It seems clear that the early version
   of this sequence should be deleted but who makes this decision? Can
   GenBank be faulted for not upgrading the other sequence?

6. My lab has resequenced a part of a gene that is already in the database
   and we obtain a different sequence at a few positions. Obviously we
   carefully check these positions and demonstrate that our sequence data
   is correct but we cannot rule out cloning artifacts in our lab or in the
   original lab. Should I be allowed to change the sequence in the database
   on the grounds that the part which I resequenced is a better version?

7. What if my lab sequences a human gene and discovers that there is a 
   little bit of highly inaccurate sequence already in the database (from a
   group that had done one sequencing run on a cDNA clone). Should the cDNA 
   sequence be discarded and replaced by the complete genomic sequence?
   (Would it make a difference if the inaccurate cDNA sequence was 
   patented? (-:  )

There are several examples of errors that are probably GenBank's fault and
in my experience these are quickly corrected when GenBank is alerted. The
problems are with those errors that are NOT GenBank's fault. It is not
obvious if, and how, such errors should be handled.

My own feeling is that it would be desirable for experts to cull the database
and make intelligent decisions about redundancy and errors. No data should
be ommitted but it could be relegated to annotation. I doubt that there
will be many "volunteers" to do this job. Incidently, I believe that most 
sequences are no more than 99.4% accurate (ie. 6 errors per 1000 nucleotides)
so we shouldn't get too upset about errors in the database.

Laurence A. Moran (Larry)
Dept. of Biochemistry
University of Toronto



More information about the Bioforum mailing list