GenBank Errors
L.A. Moran
lamoran at gpu.utcs.utoronto.ca
Sun Oct 20 13:50:45 EST 1991
The Genbank database contains quite a few redundant records and many of the
records contain errors. Furthermore there are significant differences in
the quality of individual records - some are fully and informatively
annotated and others are not.
In my experience it is wrong to blame GenBank for most of these shortcomings
because it is usually the authors who have submitted a sequence with errors
or with inadequate information. I do not think that it is reasonable to
expect GenBank to police submissions or to expect a higher level of accuracy
than that expected by the journals. (Most of the errors that I have detected
are also in the published sequences.)
As a thought experiment let us assume that we need to "clean up" the database.
Here are some problems for you to consider. I believe that they illustrate
the difficulties involved and the kinds of decisions that are required in
order to improve the situation. What would YOU do?
1. There are three sequences of the yeast BiP gene in the database. The
sequences are almost identical but there are some differences between
the three labs. It seems desirable to merge these three records but
how does one handle the differences? Should we go with the consensus
in every case?
2. There are several examples of the B. subtilis dnaK gene. Two of these
are complete sequences but they are very different. Who's correct?
How could we merge these two records? Should the differences be noted
in the two separate records, and if so who will do this?
3. The sequence of the human hsp70A1 gene is incorrect in the database. It
contains a large transposition of sequence that was also in the original
PNAS publication. When GenBank is informed of this by a third party should
they immediately alter the record or should they consult with the author
for conformation? What if the author does not respond? The important
questions here is who do you believe, and who has the right to alter
the database?
4. One of the Leishmania hsp70 sequences is obviously full of errors because
there isn't even an open reading frame. Yet the sequence in the database
is the same as the one that was published and the publication claims that
it is a functional gene. Should GenBank insert a comment that states that
the sequence may be full of errors, over the objection of the authors?
5. The database contains an early version the the E. coli dnaK gene that was
submitted by the authors. There is also a later more correct version of
the same sequence that was published and also submitted by the authors.
Subsequently additional corrections to this sequence have been published
but not entered in the database. It seems clear that the early version
of this sequence should be deleted but who makes this decision? Can
GenBank be faulted for not upgrading the other sequence?
6. My lab has resequenced a part of a gene that is already in the database
and we obtain a different sequence at a few positions. Obviously we
carefully check these positions and demonstrate that our sequence data
is correct but we cannot rule out cloning artifacts in our lab or in the
original lab. Should I be allowed to change the sequence in the database
on the grounds that the part which I resequenced is a better version?
7. What if my lab sequences a human gene and discovers that there is a
little bit of highly inaccurate sequence already in the database (from a
group that had done one sequencing run on a cDNA clone). Should the cDNA
sequence be discarded and replaced by the complete genomic sequence?
(Would it make a difference if the inaccurate cDNA sequence was
patented? (-: )
There are several examples of errors that are probably GenBank's fault and
in my experience these are quickly corrected when GenBank is alerted. The
problems are with those errors that are NOT GenBank's fault. It is not
obvious if, and how, such errors should be handled.
My own feeling is that it would be desirable for experts to cull the database
and make intelligent decisions about redundancy and errors. No data should
be ommitted but it could be relegated to annotation. I doubt that there
will be many "volunteers" to do this job. Incidently, I believe that most
sequences are no more than 99.4% accurate (ie. 6 errors per 1000 nucleotides)
so we shouldn't get too upset about errors in the database.
Laurence A. Moran (Larry)
Dept. of Biochemistry
University of Toronto
More information about the Bioforum
mailing list