GenBank errors, accountability, reconciliation
toms at fcs260c2.ncifcrf.gov
Mon Oct 21 21:19:17 EST 1991
In article <1991Oct21.145134.21977 at magnus.acs.ohio-state.edu>
gchacko at magnus.acs.ohio-state.edu (George W Chacko) writes:
>In article <9110191820.AA02957 at primate.cshl.org> marr at CSHL.ORG (Thomas G. Marr) writes:
>>with his irresponsible, uninformed, and generally ludicrous bellowings
>>while he was an official GenBank advisor. His remarks, as noted by
>>Jim Cassatt, are erroneous, exagerated, and dangerously close to libel.
>>This is precisely the type of behavior that was exhibited by a few vocal,
>>highly visible, influential, yet remarkably incompetant individuals which
>>led to the current confusion and uncertain nature of the future of the
>..other remarks deleted..
>> Furthermore, considering the tarnishing nature
>>of his remarks on a widely-read, public electronic service, I suggest that he
>>be banned from further use of this service unless he has something
>>substantive or even interesting to say. I think a letter should be sent to the
>>Director of NIH by the GenBank staff describing this recent exchange, showing
>>her a good example of what happens when the peer review process is
>Considering that Thomas Marr takes such grave exception to the Tom Schneider's
>postings, I find it strange that he should resort to name calling in turn.
>Furthermore, I believe that suggesting that Tom Schneider, or anyone be banned
>from the Usenet is absurd.
WOW! I must not be getting some of these postings. I have not seen more than
these bits above. Other people have sent me other things. Our machine is
dropping postings, so I may not be able to respond to statements. In
particular, Dave's posting came some time after he sent it. The net is not
So I'll just respond to what I do see.
First, let me apologize if I offended you Tom Marr. I was certainly NOT
thinking of you or indeed of anyone in particular when I wrote my first
postings. I was and am still extremely concerned that the status of GenBank
has not improved in certain directions. Thus we have the identical goal of a
great database. Perhaps if you thought that my statements while an advisor
were so off the mark, you should have had a conversation or 5 with me on the
side. My weak memory says to me that you didn't say anything, or now that I
think about it, that we might have said we would converse later. I recall that
the top level RDBMS definitions were good, but the lower level documentation I
saw at that time defining the relational database was difficult to get through
because everything was intertwined and there were no definitions written out in
English of all the terms or how the whole thing worked. I recall trying to
encourage you to create more documentation. Perhaps this has improved since
then. At least I TRIED to plow through it when I read it! I specifically
remember sitting in my living room in Boulder being puzzled just before the
meeting. In any case, the definition of the relational database is not what
I've been calling for. The Definition of GenBank would not mention the exact
implementation method. If I was uninformed, I should think that it was your
duty to inform and correct me. If I made ludicrous statements, then you should
have pointed out what was wrong with them. The words 'irresponsible',
'bellowings' and 'incompetant' are "bordering on libel", and the target is
quite precisely clear in this case. HOWEVER, I would much rather that we all
work together on this, I am not interested in fighting (although I will defend
myself). Let's get on with the task of finding out what is wrong with GenBank
and fixing it before it overwhelms us.
I think that all this means that you still don't understand what I am saying.
If you do, please make an accurate statement representing it. Certainly you
have enough material from my recent postings, and you can always go back to the
Delila documents (NAR 10:3013, 1982 and the manual delman and the DEFINITION OF
THE DELILA DATABASE libdef which are available by anonymous ftp from
ncifcrf.gov in pub/delila). If not, please say that.
>>are erroneous, exagerated
Other than my few unfortunate remarks, I believe that my statements were
accurate and up to date. I checked the latest GenBank entries just in case you
had made the correction. So they were accurate within minutes before the
postings. I cannot see how I exaggerated the intron boundary effect. The
programs I used operate correctly as far as I know. I do not exaggerate to say
that many entries in the database appear to have pieces cut off of them. If
you have an alternative explanation, it must be amazing because sequencing
techniques do not fade out exactly on multiples of 5! I did not exaggerate when
I discovered the duplications of Tn5, Tn3 or pBR322. I did not exaggerate when
I pointed out the inconsistencies between the two pBR322 entries. I have been
noting for several years now that when I pull entries from the database I have
found problems. I must not be the only person finding problems, and a number
of personal email letters to me now attests to that.
As for erroneous, I even regularly check my spelling, and, for example, found
in this posting that the words you used are spelled 'exaggerate' and
'incompetent'. However, I don't claim to be perfect!
If you insist on suppressing people like me, then we are in a truly sad
state. Recall the recent world events, particularly what the first action was
in the Soviet Union when the coup began: suppress the press. Look in last
weekend's Parade magazine to read about it. Nuf said.
>> Furthermore, considering the tarnishing nature
You mean, I presume, that it tarnishes the image of GenBank as the perfect
repository of Genetic Data. Well, it's not. It's not even near that by a long
shot! If it were, I could write the following Delila I statements:
organism E.coli; chromosome E.coli; gene lacZ;
get from gene begin to gene end;
and I would obtain the entire lacZ gene, snipped out of the database. I can't
do this in general because not every gene and sequence object has a name!
(LacZ is named, but it only took one try to find out that LexA isn't!) WHY?
Because it was never DEFINED to have a name. It is, unfortunately, not a
requirement of the database that things have their genetic names. Doesn't it
seem reasonable that when a genetic name is defined it should be in the
database? COMMENTS DON'T COUNT: THEY CANNOT BE PARSED!!!
Delila II is the mythical language I hope to write some day when GenBank gets
properly straightened out. It is to be an extension of Delila I. If the
database didn't contain duplicates I could write the following Delila II
get from -100 to +100 around all ribosome_binding_sites on mRNA;
And Voila! I would have a file, in GenBank format, which contained only the
201 bases around the starts of every known ribosome binding site, with the
proviso that the fragments are on mRNA. (How do you suppose the numbering of
the fragments should be, 1 to n?) It took a very smart high school student
(now at MIT) 8 months to plow through the duplications in the human gene
Information theoretic analysis of important binding sites in GenBank would
become easy to trivial if the database contained the appropriate information.
(See JMB 188: 415-431, 1986; NAR 18: 6097, 1990) Until duplications are
removed, this approach is doomed.
Notice that the Delila I statements are permanent in the sense that because the
names are genetic ones, they are reasonably stable. Those instructions are
likely to work 100 years from now! It works even if the sequence merges into
the whole E. coli genome, because the instruction is ABSOLUTE rather than
relative. The numbers of the objects in GenBank are unstable. Unfortunately
my student's efforts are being actively degraded because of these changes. We
can't just write down instructions and have most of them stay the same in the
future because the names are not consistently there to hang instructions off
Any instruction based on a GenBank Locus name is doomed within a short time.
Those names are unstable. Accession numbers are a fine solution, but do you
remember what they refer to?
>>substantive or even interesting to say.
I said these things 10 years ago. They have substance since they have a direct
implication for the quality of scientific work which can be done with the
database. Interesting? I can never get you to agree with that. But perhaps
you will find the theory of molecular machines (JTB 148(1):83-137, 1991)
interesting and substantial. I challenge you to read it and find a SINGLE
error in it other than the few I already know about (available on request). It
has very precise things to say about ribosome binding which can be critically
tested by proper use of the database. Since the database is a huge mess of
fragments having more to do with human sequencing efforts than the natural
structure of biology, such statistical tests are highly tedious.
How can we even THINK of doing a human genome project if we can't keep tiny
little E. coli straight?
National Cancer Institute
Laboratory of Mathematical Biology
Frederick, Maryland 21702-1201
toms at ncifcrf.gov
More information about the Bioforum