Errors in databases (was: GenBank Errors)

Larry Hunter hunter at
Mon Oct 21 16:16:58 EST 1991

I have two comments on the database errors thread.  I should point out
now that I am speaking for only myself here, not the National Library
of Medicine.

First, Bruce Roe (BROE at AARDVARK.UCS.UOKNOR.EDU) says:

  If I publish an article in a journal and later find out I made a
  mistake in an interpretation, should all the librians world wide rip
  out those pages from the journal and put a note in place saying
  sorry but Bruce was wrong?  I'd rather see all our sins in the
  database unless the original author agrees to the change their
  original data interpretation.

I agree, and would like to point out a problem with an important
database that doesn't follow this line: Brookhaven's PDB.  The
crystallographic database actually removes old structures that have
been superceded.

I discovered this trying to replicate an important application of
machine learning to protein structure prediction (Qian & Sejnowski,
JMB 1988 v202, pp 865-884).  The performance of learning programs
depends crucially on the examples they are trained with.  In order to
reproduce a result (e.g. to challenge an underlying assumption) the
same training examples must be used.  A large number of the training
examples used in the paper are no longer in the database.  They
missing structures have been superceded with corrected or improved
structures (e.g. at the time of the paper, the endothiapepsin acid
protease structure was accessioned as 2ape; currently the structure is
4ape).  Unfortunately, this improvement in the quality of the database
(I do not doubt the structure is improved) makes it very difficult to
replicate the machine learning result.  In fact, it appears that these
superceded PDB structures are not available from any public source.

There are other reasons to want to keep all the published results
(including superceded ones) in the database.  For example, it might be
interesting to do an analysis of the kinds of errors that occur in the
databases.  If erroneous entries are actually removed (rather than
marked as superceded) these studies are impossible.

I feel very strongly that, barring error in data entry, once an item
is put in a national database, it ought to stay there.  Updates,
improvements, etc. can all be implemented so that it is clear that an
old entry is in error without removing it.    The only cost is a
relatively modest increase in disk space required, and the benefits
are scientifically significant.

Second, I feel compelled to comment on Tom Marr's (marr at CSHL.ORG)
response to Tom Schneider's (toms at original
posting. Marr says:

  Furthermore, considering the tarnishing nature of his remarks on a
  widely-read, public electronic service, I suggest that he be banned
  from further use of this service unless he has something substantive
  or even interesting to say.

Regardless of how one feels about the content of Schneider's posting,
the idea of banning him strikes me as seriously wrong.  The purpose of
this forum is to *encourage scientific communication*.  So long as a
posting is germane and obeys the general usenet posting policy (see
news.announce.newusers) I think it should be allowed on this
unmoderated newsgroup.  The appropriate response to a posting you
think is wrong (even seriously wrong) is to try to correct it via
either private or public discussion of the issue.  Banning speech by
those with unpopular views is clearly a mistake.

BTW, it is easy to configure almost any newsreader to ignore articles
on particular topics or by particular authors.  If you don't want to
read what someone has to say, add him/her to your "kill file" and let
the rest of us make our own decisions.  See the documentation of your
newsreader for details.


Lawrence Hunter, PhD.
National Library of Medicine
Bldg. 38A, MS-54
Bethesda. MD 20894
(301) 496-9300
(301) 496-0673 (fax)
hunter at (internet)

