more on GenBank errors

smouldering dog owhite at nmsu.edu
Wed Oct 23 23:31:36 EST 1991


In article <9110231652.AA14074 at primate.cshl.org> marr at CSHL.ORG (Thomas G. Marr) writes:

> I guess I should respond to some of the criticisms leveled against my
> original response to Tom Schneider's remarks about errors in the
> GenBank database...
>
> I think what disturbed me most about his remarks was the fact that
> blanket assertions were made with no supporting data or statistics.
> For example, if he has encountered errors in the database, then we
> should have the details:
>
>
> 1. How many errors have been encountered in the database? How many
>  entries have been examined? If there are, say, 30,000 entries in
>  the database and Tom S. has examined 1,000 entries and has found 30 
>  errors then this tells us something.

During some experimental work I have done, I discovered errors in the
CDS portions of GenBank features files.  These mistakes were incorrect
designations of exon-intron borders that were not in the original
journal article.  I suspect that these errors were either introduced
when the authors of these articles electronically submitted these
entries, or when or when they were manually typed at GenBank.  I have
notified the genbank.updates about the plant genes.  The locus names
are provided, with the number of errors in parenthesis.

plants:
of 279 genes examined,
31 mistakes (11%) were found
RICRAC2(4)      RICRAC3(2)      RICRAC7(4)      MZEOPA2(2)      
BLYGLUEND(1)	MZEOPA2(2)      CIPPPCA(1)      CIPPPCB(2)
PETRBCS08(2)    TOMCAB8(2)      TOMTRYINHI(4)   TRTHB(3)
PEAPHY(1)       PEALEGAG(1)


rodents:
of 202 genes examined,
6 mistakes (2.9%) were found
MUSIGULVJ(1)	MUSPSPC(2) RATTRPM2B (3)

> 2. What is the nature of the each error? Is the error most likely
>  attributable to mistakes that the GenBank (or EMBL, DDBJ) staff has
>  made? Or is the error attributable to the original author? Is the
>  error attributable to software written by a secondary distributor of
>  GenBank? The point here is that there are many independent sources of
>  error and to be able to proceed with a plan to fix errors, this
>  detailed type of information is required. It does nothing to make wide
>  assertions which are prejudicial and not based upon the scientific
>  method. I for one have little room for prejudice whether it's in a
>  social context or a scientific context.

I am sorry this post doesn't answer to the above questions in a more
rigorous way.  Certainly, some accessions were labeled "automatic".
The point I am in agreement with, is that there are many possible
sources for mistakes.  This is am emotional issue, and interestingly,
all parties concerned (at least appear to) want to see GenBank work.
I am of the opinion that casting blame is not as much the issue, as
what can be done to correct known errors in the database.  
--

	owen white		(owhite at nmsu.edu)

-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-*-=-=-*-=-
		     there is no god, there is only noise
		     there is no noise, there is only god
-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-*-=-=-*-=-

the difference between art and science is that in art, if something
	works, it doesn't have to make sense.



More information about the Bioforum mailing list