Why all the excitement?

Tom Schneider toms at fcs260c2.ncifcrf.gov
Wed Oct 23 12:45:05 EST 1991


In article <9110221345.AA15016 at genbank.bio.net> T80SMS1 at NIU.bitnet
(Samuel M. Scheiner) writes:

>As a population biologist I have been rather bemused
>by all of excitement over the notion of inconsistencies
>in gene sequences.  Why should we expect two different
>attempts to sequence the "same" gene at two different
>times in two different locations to be identical?  What
>ever happened to mutation? 8-)  From my perspective it is
>the *variation* that is interesting.  Rather than
>masking it by declaring a single "consensus" sequence,
>these variants should be *emphasised*.

There are several kinds of apparent "inconsistency", and we must carefully
distinguish between them.  Consider the PBR322 case that I bumbled into.  Two
entries, SYNR322 and PBR322 have both claim to be the complete sequence of
pBR322.  There are at least two possibilities (which I have not taken time to
figure out):

1.  They are strain differences, as you point out is possible.  This is, of
course, very useful information and should be stored in the database.  In this
case, I would say that the sequences should be merged with annotation that
allows a program to generate either sequence, tagged with the strain names.  In
other words, an apparent "inconsistency" may be biologically important and
should be noted.  A "consensus" sequence should NEVER be stored in the
database.

2.  The difference is an error in sequence entry.  This should show up as a
difference between the sequence and the published sequence.  I know that there
were corrections to the pBR322 sequence; perhaps what happened is that one
sequence was corrected and the other was not.  The incorrect one should be
corrected.

Some people are worried about what to do in these cases.  In my experience,
a little footwork (NB it CAN take hours to do this though!) one can figure
out what has happened, and know exactly what steps to take.  I think that
almost everyone would agree on those steps, and that there is therefore
no need to worry about data interpretation, just mistakes by the person
who makes the corrections!

Also, this case displays one reason that I advocate merges or at least vigorous
pursuit of a merge viewpoint of the database.  Consider those fragments of Tn5
which overlap.  Consider the following IMAGINARY scenario.  Suppose someone
sequences a region (as we did recently) and discovers that their results are
not consistent with the GenBank data.  They contact the original authors about
it and those authors look back at their original data and see that, indeed
there was an error.  Both parties send a note to GenBank about the correction.
The GenBank personnel pick up the relevant entry and correct it.  Imagine that
they do not realize that the same sequence exists on two other independent
entries.  The database becomes inconsistent.  A month later, someone else
merges the sequences together by hand (which could easily be the 6th
independent effort to do that!) and discovers that two are different.  They
find that one of the entries differs from the other two and from the
literature.  Not knowing better, they erase the correction carefully made by
the other two labs!  For this reason, it is clear that a historical record of
the correction has to be made, so that corrections don't get lost later on.
However, by merging the sequences from the start, this potential problem would
not come up.  The correction would have been made and recorded.  The next party
would have taken the sequence and used it directly for their work.

I just realized that we did not sequence far enough, but the region of TRN5IR1
just beyond where we could see is missing a base relative to the two other
entries which overlap it, ECOTN5X and TRN5NEO.  So my story is pretty
realistic!  Does anyone know which of the three entries is correct?  I put my
bet on the latter two, because (although I did not check), the deletion of base
1666 in TRN5NEO would wreck the reading frame of the gene it is in.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov



More information about the Bioforum mailing list