GenBank Errors

Tom Schneider toms at
Sat Oct 26 18:33:23 EST 1991

In article <1991Oct20.185045.26318 at>
lamoran at (L.A. Moran) writes:

>1. There are three sequences of the yeast BiP gene in the database. The
>   sequences are almost identical but there are some differences between
>   the three labs. It seems desirable to merge these three records but
>   how does one handle the differences? Should we go with the consensus
>   in every case?

NEVER enter a consensus!  NEVER!  Establish the first sequence as the primary
one, then show the differences with annotation from then on.  Be sure that
Brian's code can extract all known strain sequences simply by naming the gene
name and the strain desired.  Voila', clean solution.

>2. There are several examples of the B. subtilis dnaK gene. Two of these
>   are complete sequences but they are very different. Who's correct?
>   How could we merge these two records? Should the differences be noted
>   in the two separate records, and if so who will do this?

You are right, this is a huge task.  The orginal labs should be responsible.
Unfortunately most molecular biology labs don't know about database issues like
this.  We have some education to do!  The two labs have to work out what the
differences are.  If they are strain differences, I suggest recording as
described above.

>3. The sequence of the human hsp70A1 gene is incorrect in the database. It
>   contains a large transposition of sequence that was also in the original
>   PNAS publication. When GenBank is informed of this by a third party should
>   they immediately alter the record or should they consult with the author
>   for conformation? What if the author does not respond? The important
>   questions here is who do you believe, and who has the right to alter
>   the database?

Consult with the author.  If the author does not respond within a reasonable
time (as stated in the original message, eg, 6 months), proceed with the

The right to alter the database is a tough one.  I could easily make up
sequences which look like real ones, but are entirely artificial.  I would not
be too surprised to find some of these in the database already!  The ONLY way
we are going to detect these is to get more and more sophisticated statistical
analysis.  Which we can't do easily because the database is such a mess...
Normally, I would assume that the GenBank staff has to decide whether the
source seems to be legitimate.  That's not an easy job.

>4. One of the Leishmania hsp70 sequences is obviously full of errors because
>   there isn't even an open reading frame. Yet the sequence in the database
>   is the same as the one that was published and the publication claims that
>   it is a functional gene. Should GenBank insert a comment that states that
>   the sequence may be full of errors, over the objection of the authors?

GenBank could 1) note that the sequence is as published by quadruple (or
however many) independent entries by 4 different people; 2) note that there is
no open reading frame; 3) note that the authors claim that there is an open
reading frame (or whatever with introns and exons); 4) note that there is an
inconsistancy.  The latter could be a special tagged word, so that, for example
(!) committies reviewing grants would be alerted to the potentially sloppy
work.  In other words, when people realize that their errors are being recorded
into the international database for all to see, and there is some potential
bite if they make a mess, they will clean up their acts.  GenBank only needs to
state the facts, the rest of us will understand.

>5. The database contains an early version the the E. coli dnaK gene that was
>   submitted by the authors. There is also a later more correct version of 
>   the same sequence that was published and also submitted by the authors.
>   Subsequently additional corrections to this sequence have been published
>   but not entered in the database. It seems clear that the early version
>   of this sequence should be deleted but who makes this decision? Can
>   GenBank be faulted for not upgrading the other sequence?

Yes.  This is directly hurting scientific research!  My initial inclination was
to delete the original sequence, but this is not reasonable for two reasons.
First, someone who goes back to the paper will be confused why it does not
match the genbank sequence.  Second, those who wish to reconstruct an analysis
performed with the original sequence would be thwarted.  Yet the clean sequence
should predominate in the database.  The original sequence could be easily made
available by the appropriate annotation.  The annotation should contain warning
flags that the original sequence had errors, and how they were corrected.

>6. My lab has resequenced a part of a gene that is already in the database
>   and we obtain a different sequence at a few positions. Obviously we
>   carefully check these positions and demonstrate that our sequence data
>   is correct but we cannot rule out cloning artifacts in our lab or in the
>   original lab. Should I be allowed to change the sequence in the database
>   on the grounds that the part which I resequenced is a better version?

A tough call.  This could be a strain difference, which we would not want to
lose.  I think that GenBank already records these as a "conflict", correct me
if that's wrong.  The only way to resolve it is for both labs to look over each
other's data, perhaps swap strains and do some more work.  It may not be worth
the effort, in which case the 'conflict' flag at least tells the next person
who is in a position to care and fix it what to do next.

>7. What if my lab sequences a human gene and discovers that there is a 
>   little bit of highly inaccurate sequence already in the database (from a
>   group that had done one sequencing run on a cDNA clone). Should the cDNA 
>   sequence be discarded and replaced by the complete genomic sequence?
>   (Would it make a difference if the inaccurate cDNA sequence was 
>   patented? (-:  )

HA!  I LOVE IT!  PATENT A USELESS SEQUENCE!!  Let the patent office figure
that one out.  This is the same problem as #6.

>There are several examples of errors that are probably GenBank's fault and
>in my experience these are quickly corrected when GenBank is alerted. The
>problems are with those errors that are NOT GenBank's fault. It is not
>obvious if, and how, such errors should be handled.

Go back to the original authors, and anybody else willing to do the
experimental work, and let them figure it out.  It's just like regular science,
it should be self correcting EXCEPT that we all have a common pool of data.

>My own feeling is that it would be desirable for experts to cull the database
>and make intelligent decisions about redundancy and errors.

Yes, but so far, the curator program has not been even 1000th the size it needs
to be.  Maybe there should be a special pot of money to pay people directly who
find errors in the database.  I bet you'd quickly see a cottage industry appear
which scouted for errors.  (Hey, I get to claim those duplications!!  That's
worth 30 bucks right there - 10$ per duplicated entry!  And I have another
duplication I'll tell you about later that's worth, lesee... 50$!  Because it
is so weird...)

>No data should be ommitted but it could be relegated to annotation.  I doubt
>that there will be many "volunteers" to do this job.

Any lab involved in sequencing is responsible that the sequences they work with
are correct, as far as they know, in the databases.  Make this a requirement
for funding of grants, and everybody will "volunteer"!  :-)

Or make that little industry boom...$$$...

>Incidently, I believe that most sequences are no more than 99.4% accurate (ie.
>6 errors per 1000 nucleotides) so we shouldn't get too upset about errors in
>the database.

But when detected, we should do our best to correct them.
Or are you suggesting to wait for better technology?

Nice gedanken questions.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at

More information about the Bioforum mailing list