GenBank Errors

Tom Schneider toms at fcs260c2.ncifcrf.gov
Sat Oct 19 16:20:57 EST 1991


In article <9110162003.AA23997 at genbank.bio.net> CZJ at CU.NIH.GOV writes:
>> Folks:
>>
>> I have found that the following entries overlap:
>>
>> TRN5IR1 1438 1737   =  TRN5NEO   1 301
>> TRN5NEO 901  1300   =  TRNTN5STR 1 400
>> TRN5NEO             =  ECOTN5X   (exactly, reported in previous posting)
>>
>> In addition, the data are inconsistent:
>> TRN5IR1 is missing base 284 (c) in TRN5NEO.

>I do not usually get upset by e-mail messages, but Tom's
>recent e-mail tone of Tom's e-mail message has no place
>on a public bboard.  The conclusions are wrong and border
>on libel.
>
>Jim Cassatt
>GenBank Project Officer

Jim:  I apologize for remarks which may have been taken as personal insults, my
remarks were not intended as such.  However, so far as I could investigate, the
remarks quoted above are accurate.  Also, I do not agree that these conclusions
are wrong:

>GenBank is:
>
>  REDUNDANT
>  INCONSISTENT
>  FULL OF ERRORS

I find examples of these every time I look into the database.

As a test, I used IRX to find entries of Tn10.  In the 14 entries which I
found, there exists one complete entry and several other entries which (by
their descriptions only) are likely to be duplications of the complete entry.
Thus (without having done a careful study as I did for Tn5) there is (probably)
REDUNDANCY in the database for this case also.  It is well known that
redundancy in a database leads to INCONSISTENCY because one place will be
corrected while the other is not.  I consider this an ERROR in database
structure.

Ok, on a blind shot, where I will write exactly what I find,
how about Tn3?  77 entries, mostly junction sequences.
One has the whole Tn3: TRN3.
Another has the repressor gene: TRN3RT.  (Is it the same as TRN3?  Not checked.)
Another has the whole thing again: ECOTN3X.
I find that the sequences for TRN3 and ECOTN3X are precisely identical!

The TRN3 entry has a title with the word DNA, but the same title in the other
entry is dna, a minor error.  The features are, of course, not recorded
identically.

Here's an interesting one:
LOCUS       RSC13        7894 bp ds-DNA   Circular  SYN       03-MAY-1985
DEFINITION  Plasmid Rsc13 (derived from R1), complete genome, with a complete
            copy of Tn3.
I do not know if it has the same sequence as the other two.
But there are some conflicts recorded in TRN3 which are not
noted in ECOTN3X.  This redundancy could lead to uncorrected errors later.

Whoa!  In that search, I also found:
LOCUS       SYNR322      4363 bp ds-DNA             SYN       12-JUN-1991
DEFINITION  Plasmid pBR322 complete sequence.

SYNR322 is the same as PBR322!
LOCUS       PBR322       4361 bp ds-DNA   Circular  SYN       20-MAY-1991
DEFINITION  Plasmid pBR322, complete genome.

And what's this in SYNR322?
COMMENT     SWISS-PROT; P00810; BLAT$ECOLI. SWISS-PROT; P02981; TCR2$ECOLI.
            SWISS-PROT; P03051; ROP$ECOLI. SWISS-PROT; P03850; YPB2$ECOLI.
            SWISS-PROT; P03851; YPB3$ECOLI. SWISS-PROT; P03852; YPB1$ECOLI.
            SWISS-PROT; P03853; YPB4$ECOLI.
            The circular sequence is numbered such that 0 is the middle of the
            unique EcoRI site and the count increases first through the tet
            genes, the pMB1 material, and finally through the Tn3 region.
            From EMBL 27   entry PBR322;  dated 16-JUN-1990.

I picked it up because I searched for Tn3; Tn3 in PBR???  Really?
Perhaps SYNR322 is not labeled correctly?

Hmm.  They are not the same size.  The sequences differ.
Let's see what my merge program has to say...
file a: PBR322              
file b: SYNR322

i am 99% sure that this has a deletion in a (insertion in b) of 1 character:
file a: line 33
ttcatcggtatcattacccccatgaacagaaatcccccttacacggaggcatcagtgaccaaacaggaaaaaa
                                ><                    xxxxx x  xxx x
                                at 34 deletion, 10 mismatches downstream
file b: line 33
ttcatcggtatcattacccccatgaacagaaattcccccttacacggaggcatcaagtgaccaaacaggaaaaaa
                                >i<                    xxxxx x  xxx x
                                at 34 insertion, 10 mismatches downstream

The program could not figure out the second insertion because of the
mismatches.  These are the only differences between the two files.
It is not obvious from the entries why SYNR322 has these changes.

Finding errors in GenBank is so easy that I infer that the whole database is
riddled with errors.

GenBank was started to eliminate the redundant efforts of workers all around
the world to enter sequences.  We are in a similar situation today, but one
jump up.  GenBank now does a GREAT job of getting the sequences into the
database, as far as I have seen.  But raw sequences must be processed and
annotated to be useful to biologists.  In the long run this will take far more
effort than the actual sequencing, because GenBank and other databases will
come to represent huge amounts of our biological knowledge and because
sequencing will become easier.  The situation is reminiscent of the past
because anyone who wants to look at the sequence of Tn5 will have to struggle
as I did to find the overlaps, note the errors, and throw out the duplicates.
Of course having reported this case, it should not be a problem for the next
person.  However I would think that people working on Tn5 or the GenBank staff
or a curator would have straightened it out sometime within the past 6 YEARS
since the most recent publication on this topic (according to the references in
the entries) was in 1985.  Is the database already so large that an important
biological element cannot be compactly and correctly stored in it?  If so,
then we are already in severe trouble.

Regards, Tom

>Dear Tom,
>
>I have read your note with interest. Since it would seem that you have
>given up on GenBank, I will not ask my staff to devote time to
>answering your message, but rather treat this as an unfortunately
>worded rhetorical question on your part. We will nonetheless look into
>your observations with respect to the specific errors you reported.

Paul:  I have certainly not given up on GenBank, or I would not have posted
anything.  I would suggest instead that it is those who are silent on the
matter of problems in the database who have given up.  I agree it is certainly
not worth your time to respond to my anger, however thank you for looking into
these particular errors, and also the ones listed above.

>We commend, and will of course honour your exhortation to your
>colleagues on the net to report errors to us; as always, we are happy
>to admit we are not perfect and welcome such reports at:
>update at genome.lanl.gov.

I will be doing this in the future, as I have in the past.

>My regards,
>
>--paul
>Paul Gilna
>GenBank, Los Alamos

Regards, Tom



More information about the Bioforum mailing list