more on GenBank errors

Tom Schneider toms at fcs260c2.ncifcrf.gov
Sat Oct 26 19:05:09 EST 1991


In article <9110231652.AA14074 at primate.cshl.org> marr at CSHL.ORG (Thomas G. Marr) writes:
>I guess I should respond to some of the criticisms leveled against my
>original response to Tom Schneider's remarks about errors in the GenBank
>database...

Although your original posting did not ever make it to my computer, I got
bits.  Let me first apologize specifically to you Tom for the consternation my
original postings may have brought you.

> I think what disturbed me most about his remarks was the fact
>that blanket assertions were made with no supporting data or statistics.
>For example, if he has encountered errors in the database, then we should
>have the details: 

I have not decided to devote my life to gathering statistics of errors in
GenBank.  It would be a huge and UNDOUBTEDLY both fascinating and frustrating
task.  I say this because in some previous postings others have suggested it
may be boring.  On the contrary, one can learn some biology as one goes.  But I
cannot devote energy in this direction except when it affects my work
directly.

Unfortunately, this is all the time.  We are looking at the binding sites from
a large number of systems now, and it is a struggle every time.  We find errors
every time we do something.  We have not kept statistics, but every time I go
into the database I find a problem or 5.

An easy way to demonstrate this for yourself is to use IRX for any system
of genes you like in E. coli.  I know that Kenn Rudd (and someone I
don't know the name of, sorry) are making merged databases of E. coli,
but unfortunately they are not (yet) in GenBank.  Within the E. coli
entries, one easily will find duplicate entries, overlaps, inconsistencies
etc.  If EVERY TIME I try, and I have tried 20 times in the last few
months, then I infer that most entries have errors.

I do not blame GenBank for errors made by others, but I do think that it is
GenBank's responsibility to have a consistent database, where it is not
easy to find errors.

I DO NOT think it is worth doing a big statistical study of errors.  We all
just have to get down on our knees and scrub the floor!  Elbow grease!

>I have known Tom S. for many years and we have at least one thing in common
>and that is that much of our scientific livelihood depends on having GenBank
>and related resources available in a dependable way. Dependable in my book
>means complete, correct (within realistic reliability bounds), and available
>this year and beyond. We both get upset when it appears that someone or 
>something jepordizes [sic] these interests.

Agreed.  So, for example, I suddenly felt highly threatened when I realized
that we have made only partial progress toward named objects in the last 10
years:  at least (because I was "vocal" 10 years ago) we now have the ability
to store the genetic names, but yet not the political will to use it
consistently.  The database is therefore not yet reliable enough for me to
design and build Delila II on top of (for example) Brian's parser.  10 years
ago I thought that SURELY by now it would be possible, and that was why I was
willing to spend time as an advisor.  I hate to think that my time was wasted.

By the way, have you tried Delila yet?

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov



More information about the Bioforum mailing list