more on GenBank errors
Thomas G. Marr
marr at CSHL.ORG
Wed Oct 23 11:52:21 EST 1991
I guess I should respond to some of the criticisms leveled against my
original response to Tom Schneider's remarks about errors in the GenBank
I think what disturbed me most about his remarks was the fact
that blanket assertions were made with no supporting data or statistics.
For example, if he has encountered errors in the database, then we should
have the details:
1. How many errors have been encountered in the database? How many entries
have been examined? If there are, say, 30,000 entries in the database
and Tom S. has examined 1,000 entries and has found 30 errors then this
tells us something.
2. What is the nature of the each error? Is the error most likely attributable
to mistakes that the GenBank (or EMBL, DDBJ) staff has made? Or is the
error attributable to the original author? Is the error attributable to
software written by a secondary distributor of GenBank? The point here is that
there are many independent sources of error and to be able to proceed with
a plan to fix errors, this detailed type of information is required. It does
nothing to make wide assertions which are prejudicial and not
based upon the scientific method. I for one have little room for prejudice
whether it's in a social context or a scientific context.
A simple experiment could be run as follows:
To begin to address the issue of errors we can start along the following lines -
Assume that the major source of errors are random mistakes and that there are
two possible outcomes from sampling a random entry - it is completely correct
or not and that the probability of each outcome remains constant over the
time course of the experiment. Under these conditions, repeated independent
trials are Bernoulli trials and the Binomial distribution provides the
probability of x errors in a sample size of n. The Geometric distribution gives
the number of error-free entries expected before we encounter the first error.
Using these density functions we can make different assumptions about the
tolerable level of error in the database and make procedural adjustments to
correct the errors. This should work given the above assumptions about the
type and nature of errors in the database. Given the size of the database,
these assumptions may work. However, other types of analyses could be
devised to accomodate different types of errors in the database, which may
have a bias associated with them, such as errors associated with a particular
computer program used for moving data through the GenBank, EMBL, DDBJ systems.
I'm not saying that this is the answer, but I do think something along these
lines needs to be divised so that scientists can make informed opinions about
Why do I insist on non-prejudicial evaluations of GenBank and the like?
Although I do not know for sure what happened when the decision was made
to move the administration and scientific responsibility of GenBank from
NIGMS to NCBI, it is my impression that peer review was not the primary
mechanism for the change. Perhaps this is not even an issue. However, it
would seem that such decisions have potentially a major impact on a large
community of users and that the potentially affected community has a right
to know the consequences of such decisions. Will GenBank, imperfections and
all, remain essentially the same to the user community? If so, what plans
have been made to insure this? Have these plans been evaluated by knowledgeable
individuals through the peer review process? And so forth...
I have known Tom S. for many years and we have at least one thing in common
and that is that much of our scientific livelihood depends on having GenBank
and related resources available in a dependable way. Dependable in my book
means complete, correct (within realistic reliability bounds), and available
this year and beyond. We both get upset when it appears that someone or
something jepordizes these interests.
More information about the Bioforum