GenBank errors
Paul Gilna
pgil at HISTONE.LANL.GOV
Mon Oct 21 12:59:52 EST 1991
Let me try and place some (hopefully calming) perspective on
this business, which as Sanjay correctly says is about the
right issue, but on the wrong plane.
The issue is not about the fact that "errors", as variously
defined along the way to be ours, yours and everyone elses,
exist in today's databases whether as a result of past
travesties or present inadequacies; heck, we knew that!, the
issue is what are we going to DO about it.
Let me try, as accurately and as unemotionally as I can, to
describe what our approach is to this problem.
Firstly, we have a job to do; acquire (some say extort!) the
data from you and return it to you in as rapid and as accurate
a manner as possible.
The rate of growth of these data is, as Tom rightly bemoaned,
exponential; GenBank's doubling time is creeping below the one
year mark, and there are times here where we put out in one
week what it took us one year to accomplish not too long ago.
That we are managing to do this is no accident--our philosophy
has been that in order to deal with a volume of data that doubles
every year, yet maintain roughly steady-state staffing levels,
we must double our efficiency every year just to keep pace.
The only other alternative is to allow our budget to follow the
same growth curve! But we long ago realised that throwing
people at the problem was not the answer.
There are many paths we chose to achieve this, but there are
two I will mention because they have a bearing on this issue.
One was conferrence of the responsibility for "data entry" to
the community, the other was automation of that transfer.
Most of you reading this know by now that we receive 80-90% of
our data in electronically submitted form, what you dont yet
know is that about 50% comes in automated, transaction form--in
the space of one year, the "market share" of Authorin
submissions has made dramatic leaps, such that about 50% of the
data we receive enter the database completely
automatically--leaving us with only the process of data
review.
These two acheivements have relevance in the following manner:
In the past Genbank suffered because it believed it "owned" the
data, and therefore expended precious time "curating" those
data. Today, we must try to shift that onus to the author,
because ultimately THAT is the level at which error detection
and correction will have to happen. The simple fact remains
that given the volume, GenBank cannot expect to "Curate" the
entire database in order to keep it accurate, and it is neither
money nor time that keeps us from this, rather we believe that
is the responsibility of the true "owner" of the data, the
submitting author. We cannot commit to running a project that
will spend a significant proportion of its resources correcting
data "after the fact".
Even having said that, the fact that so much of our data no
longer require manual data entry, means that in the interim,
the same staff can apply themselves more efficiently to
validating the data instead if interpreting it. Our job then,
is to provide that staff (and indirectly the community) with an
increasing battery of checks against the data.
However, with the advent of automated submission tools that are
now a reality we can begin to capitalise on this by providing
that same level of data integrity checking AT THE LAB BENCH.
We cannot hope to "clean up" the errors of the past unless we
can do so fairly automatically; without the power of a
relational database at our disposal we could not hope to even
begin to address these needed changes--we would still be
plugging away entry by entry at the flatfile level. As Michael
pointed out we have already targeted what we perceive to be the
most glaring errors and we are devoting both programming and
review effort to correcting them. In addition, the curator
program has already begun to bring in a number of scientists,
each of whose goal is to "fix" selected regions of the
database.
Conferring the responsibility for data entry to the community,
conferring the responsibility for "data release" to the
community (watch for this one!), conferring the responsibility
for data correction and updating to the community, the GenBank
Curator program, author response/review feedback loops, public
error communication channels, all point to one (albeit cliched)
philosophy on the part of GenBank:
In order to succeed in managing the problem of an
exponential growth of data, we must facilitate the
community to become part of the solution to, rather
than the source of that problem.
Tom said "get on with it"; I say: "we are, and we're not finished yet"
--paul
More information about the Bioforum
mailing list