GenBank Errors

Tom Schneider toms at fcs260c2.ncifcrf.gov
Sat Oct 19 18:45:17 EST 1991


In article <Oct.16.17.35.11.1991.9987 at genbank.bio.net>
kristoff at genbank.bio.net (David Kristofferson) writes:

>... I politely suggested earlier that he issue a public retraction ...

My statement was based on partial information about the situation of the
contract, and was therefore indeed inappropriate.  Please accept my apologies.

>We appreciate a bit of civility as much as the next person.

I hope you have found my later statements civil.

>The bionet.general newsgroup is not the place to report
>GenBank errors.  As Paul Gilna at Los Alamos noted, GenBank has an
>address for this purpose, update at genome.lanl.gov.

Indeed.

>Also if you feel compelled to
>light a fire under our hind ends because you believe that your
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ yes!
>attempts to go through our regular channels have not been successful,
>I can't protest that too much either.  I would hope, however, that
>people would extend us the simple courtesy of trying normal channels
>***first***.

I was a GenBank advisor for many years with exactly the same goals as I have
been posting about.  The postings served two purposes:  to report the error and
to bring to wider discussion the difficulties of the international database.
(When I say "GenBank" I refer to the international efforts.)  Normal channels
have not worked, and the situation has worsened.  It should not have been so
trivial for me to find the errors I found in Tn3 and pBR322.  These problems
are limiting the science we can do with the data since they make massive
statistical analysis difficult.  The longer we wait, the worse it will get.
I've been suggesting solutions, such as named objects and merged entries, for
10 years.  Is that long enough?  Will I have to wait another 10 years before I
can reach into the database and find neatly organized data with minimal numbers
of errors?  Will we be so swampped with sequences that it won't happen for 50
years?

For some reason some people advocate the storage of the original literature
reports.  I have no objection to this in itself, however if there is not a
merged 'view' of the data which is rigerously supported, then the database
fails from a biological point of view.  (100 years from now people will not
care who sequenced what in 1991.  By then I suppose we will sequence the entire
genome in a few seconds.)  To avoid duplication of effort we must merge or
appear to merge the data.  The effort to do this is the same either way.  The
software to handle 'views' is more difficult, so I fear that merges will not
take place for this reason.

Here's a practical suggestion.  Every time somebody on the GenBank staff opens
an entry for work, spawn off a search process which looks for matches with the
rest of the database.  This could be done as a spot check with 100 bases every
1000 bases.  If anything is found, report to the staff the duplication.  If you
worry about processing time, don't let the search run until late at night.  The
duplicate entries of Tn5, Tn3 and pBR322 would have been found by this method.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov



More information about the Bioforum mailing list