Annotating Features (was Re: GenBank errors)

Tom Schneider toms at fcs260c2.ncifcrf.gov
Mon Oct 28 17:51:18 EST 1991


In article <1991Oct24.202948.8677 at ccu.umanitoba.ca> frist at ccu.umanitoba.ca writes:

>I have to disagree here with Tom's conceptualization of Historian vs.
>Biologist. I think it oversimplifies the issue.

I certainly have to agree with you, but would like to point out that
the distiction does help us focus on two extremes, namely

  biologist:  complete sequences merged together as in nature
  historian:  every paper is an entry, humanity intrudes in the database.

You are right that biologists would often want to look at all the strain
differences, and I agree that one solution is to define 'wild type' and
then to annotate all changes.  This will work for a while at least, and it
is better than duplication.  With the appropriate data structures in place,
we could have the fun of zooming along the genome by watching a sequence
logo of all the variants!  (sequence logos: NAR 18: 6097, 1990)

>I think we have to be very careful about merging entries, since merging
>implies that the component sequences have been derived from different
>populations, different strains etc. These differences are
>not simply of 'historical' interest, but rather reflect  the fact that
>genes and genomes are not static but dynamic. As many examples of the same
>gene get sequenced within a population, it may be very important to record
>the variations that occur. 

I agree, and your technical solution appears to be correct.

>Look at how this same data appears in Release 69.0:
>
>variation        207
>                 /note="g in AJ23036; a in mutant 1041"
>variation        443
>                 /note="a in AJ23036; c in mutant 1041"
>
>These expressions are not software parseable, so effectively, data has been
>lost!

Oh my :-(  Would the person(s) who made this decision please speak up and
explain your actions?  I see absolutely no justification for degrading the
database like this.

>HS1ULA3 (M62932)
>     CDS             1..393
>                     /gene="UL33"
>                     /codon_start=1
>     mutation        1..393
>                     /phenotype="temperature sensitive mutant"
>                     /note="Iso to Asp"
>                     /note="replace(50,`a')"
>                     /gene="UL33"
>
>Here, the location assigned to the 'mutation' keyword evaluates to the
>'wild type' sequence.

Great, THAT will mess up programs.

> The concept of a location being an expression that,
>when evaluated, returns a sequence, has been lost.

Somebody or bodies evidently don't understand why the feature table was
invented.  Perhaps it is the original authors using authorin?  It could be that
authorin does not force the user sufficiently.  I would suggest that the note
be very difficult to use, and that warnings be put in place that appear every
time telling the authorin user to try to use something else first.

Here is where a written definition of the philosophy of the database would help.
It would explain the the reasons one should do things.

>As I have said in previous postings, the published specification for the
>FEATURES language is a monumental breakthrough in making the databases
>software accessible. Although the language has some problems, it was
>obviously very carefully thought out, AND IT SHOULD BE USED! 

Agreed!  Enormous amounts of information about sequences can be entered into
this form.  One nice thing is that if the features are sufficiently atomic, the
format can be changed later, or indeed, implemented as an RDBMS or whatever.
Front end software would have to be altered, but programs using the data would
be protected.  We'll never be able to make an AI program to figure out the
contents of notes, and that would be a silly thing to be forced to do anyway.
Notes don't contain machine usable information.

>DESCRIBING COMPOSITE GENOMES AND VIRTUAL FEATURES 

I like the idea of the composite genome.  You could even define species
as a collection of composite genomes!  This simple definition would be the
input to the sequence logo zoom program I mentioned above.

>While I think that Tom's 'biologists' solution above is fine for the
>smaller merges, it may not be desireable to go that route where large
>chunks of a genome are constructed from sequences obtained from many
>different strains, subspecies, and locations. If we do so, we may be in
>danger of presenting the user with a consensus model that could be
>misleading. 

Good point.  Perhaps the best way is to merge as we can, but to keep strain
information available.  Then one would be able to force the merged view to
contain only a subset of strains!

>In this case, the 'historian's' solution may be the preferable one.

A pure historian's solution, as currently planned, means the separation of
everything by publications.  Without some kind of merge, the names would not be
consistent, and this leads to a mess.  For example, when merging two entries,
the names of the promoters have to be given their genetic names so that they
can be distinguished.  As we have it now, these objects do not have names, and
so it gets hard for someone to figure out the gene downstream of the promoters:
you have to keep comparing the numbers (which will change on the next merge).

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov



More information about the Bioforum mailing list