Annotating Features (was Re: GenBank errors)

frist at frist at
Thu Oct 24 15:29:48 EST 1991

Recently, I posted an article listing reasons why I think it is essential
that each feature in the FEATURES table should include a unique label. 
Two more recent postings prompt me to expand upon this idea
In article <2477 at> toms at (Tom Schneider)

>One can put on different hats when one does things.  As a biologist, I want the
>complete known sequence of Tn5, so I can get on with designing a PCR primer for
>our research.  As a historian I would want the original papers.  As a computer
>scientist, I would be concerned that both views are possible.  The viewpoint of
>the biologist is not being supported by GenBank!  Who is the data base for
>anyway?  :-)  Clearly we need both views, and so it becomes a technical
>problem how this is implemented.  Two methods are possible:
>1. Biologist:  Physically merge the data and keep a careful record that allows
>one to reconstruct the original data.
>2. Historian:  Keep the data separate and make a careful record that allows one
>to construct the merged data.
>These both have consequences.  The Library of Medicine is interested in the
>second method, as would make sense for a library.  For retrospective analysis,
>as one person pointed out, one will want the original data.  The question is,
>which way would the data be used most frequently?  I claim that the Biologist's
>approach will far outweigh the Historian's, and that as time goes on, fewer
>people will care about the history, just as we rarely care to recall who
>discovered a gene or named a species.
>Technically, there is a difference.  If the Historian's approach is made, then
>to merge the entire E. coli genome will require quite a bit of computer time,
>each time it is wanted.  The alternative is to generate a duplicate, which
>means that if someone forgets to updated the duplicate, the database falls out
>of date or duplications have to be done continuously...  (Or worse, changes
>get made on the duplicate, and lost on the next automatic update!)
>With the Biologist's approach, most uses of the database are immediately
>available, with no computation.  To get an original sequence, you run (an
>updated version of) Delila.  I therefore advocate the Biologists solution.

I have to disagree here with Tom's conceptualization of Historian vs.
Biologist. I think it oversimplifies the issue.

If there were no such things as different alleles, strains, subspecies,
polymorphisms, multigene families etc. then the 'biologists' solution would
make sense. This solution makes it look as if there is a 'right' answer to 
questions such as 'what is the sequence of the <insert your favorite gene>
gene', or 'what is the sequence of the <insert your favorite organism>

I think we have to be very careful about merging entries, since merging
implies that the component sequences have been derived from different
populations, different strains etc. These differences are
not simply of 'historical' interest, but rather reflect  the fact that
genes and genomes are not static but dynamic. As many examples of the same
gene get sequenced within a population, it may be very important to record
the variations that occur. 

Fortunaltely, the FEATURES language contains many of the operators and
qualifiers necessary to annotate these differences, IF ONLY THEY ARE
UTILIZED BY THE DATABASE ANNOTATORS! For example, an early implementation
of the features language annotated a strain variation in BRLTRPLE (M17892)

-                replace(207,"a")
-                replace(443,"c")
variation        group(r1,r2)

Using constructs as shown above, it is easy to reconstruct mutant_1041, if
you need to. However, recent policy changes have eliminated that
capability. Look at how this same data appears in Release 69.0:

variation        207
                 /note="g in AJ23036; a in mutant 1041"
variation        443
                 /note="a in AJ23036; c in mutant 1041"

These expressions are not software parseable, so effectively, data has been

Here's another variation on how variations are implemented:

HS1ULA3 (M62932)
     CDS             1..393
     mutation        1..393
                     /phenotype="temperature sensitive mutant"
                     /note="Iso to Asp"

Here, the location assigned to the 'mutation' keyword evaluates to the
'wild type' sequence. The concept of a location being an expression that,
when evaluated, returns a sequence, has been lost. The /note= line is an
attempt to make a parseable expression, but how is the software supposed
to distinguish between the two /note= qualifiers? Furthermore, the
replace() expression does not even conform to the published syntax for the
FEATURES language. It should read 


As I have said in previous postings, the published specification for the
FEATURES language is a monumental breakthrough in making the databases
software accessible. Although the language has some problems, it was
obviously very carefully thought out, AND IT SHOULD BE USED! 

At the level of the gene, or for small genomes which can be sequenced in
their entirety, this method of annotation will suffice. However, for only
partially-sequenced genomes, or for the construction of 'composite'
genomes (eg. piecing together large eukaryotic genomes from thousands of
citations) this may not be a tenable solution.
In article <1991Oct21.171029.5798 at> chh9 at
(Conrad Halton Halling) writes:

>>LOCUS       BACCHROMO   17974 bp ds-DNA             BCT       18-OCT-1991
>>DEFINITION  Sequence of Bacillus subtilis chromosome.
>>ACCESSION   M80245
>>       Lines deleted...
>>REFERENCE   6  (bases 1 to 17974)
>>  AUTHORS   Henner,D.J.
>>  TITLE     Sequence of Bacillus subtilis chromosome
>>  JOURNAL   Unpublished (1991)
>>       Lines deleted...
>Does this mean that M80245 will one day become THE accession number for
>Bacillus subtilis?  In the meantime, the DEFINITION is misleading in
>the extreme. :-)

Conrad brings up another issue, which is: how do we begin to piece together
large entries, such as entire chromosomes?

While I think that Tom's 'biologists' solution above is fine for the
smaller merges, it may not be desireable to go that route where large
chunks of a genome are constructed from sequences obtained from many
different strains, subspecies, and locations. If we do so, we may be in
danger of presenting the user with a consensus model that could be

In this case, the 'historian's' solution may be the preferable one. While I
don't know if plans currently exist for handling this type of construct, it
is easy enough to create 'virtual entries' that fill this need. In fact,
the current method for handling segmented entries satisfies most of these
needs. For example, GenBank entry ASYPIGG6 (M38624) contains the 
3'-most exon for the green visual pigment protein. Since the introns were
not sequenced, each exon has been placed in a separate entry. The protein
coding sequence is recreated by evaluating the following expression:
     CDS             join(M38619:160..256,M38620:11..307,M38621:11..179,
                     /product="green visual pigment"
This is a pretty clean way to do things, with the exception that base
ranges (eg. 160..256) are used instead of labels. A comparable strategy
could be used to represent large genomes. Thus, to construct a chromosome
such as the B.subtilus chromosome cited above, virtual entries could
be defined which primarily refer to other entries:

contig              join(X30322:1..12998,M33248:324..910,
                    /map="0 to 2 minutes"

(other contig's defined as needed)

chromosome         join(contig1,poly("n",12000),contig2,poly("n",2500),
                   "gcggccgc",poly("n",8000),contig3 etc...)

While this hypothetical 'virtual entry' contains several extensions to the
FEATURES language, these extensions are consistent with the style of that 
language. These extensions are:

   1) in contig1, the accession number M23345 is used with no location. This
   notation simply means that the entire entry associated with that accession
   number in included.

   2) The feature operator poly() is defined as


   where the literal expression is repeated n times. This makes it possible
   to represent long regions that have not yet been sequenced, but for which
   it might be desireable to have a place holder in order to generate a 
   'life sized' model of the chromosome. This notation also allows for
   the inclusion of such things as restriction sites which have been mapped,
   as exemplified by the Not1 site flanked by two poly() operators.

The way the contig1 was defined, you are still dependent on absolute base
ranges to define a contig. It would be better if, upon creating a
virtual entry, each entry contributing to the virtual entry had a 
feature defining that part of the entry used. For example, in place of
M33248:324..910, the entry of which M33248 was a part had a feature such

misc_feature      324..910

then the expression could be rewritten as 'M33248:B_sub1'. That way, even 
if sequence was added to M33248, or it was merged into a larger entry, 
the label 'B_sub1' would still point to the same feature.

Since virtual entries are relatively quite small, it would be easy to
create as many alternative virtual entries as there are strains or
subspecies. In fact, there is no reason that you couldn't have a 'standard'
entry for some strain arbitrarily defined as 'wild type', and create secondary
virtual entries to describe variants, where these are needed. Such variants
need only contain enough instructions to modify the standard entry.

Don't tell me that constructs like those shown above are too difficult
for software to handle. With relatively minor changes, the GETOB program
of XYLEM could evaluate such expressions today!

>There is another consequence of the Historian's approach.  If the software is
>not done perfectly, inconsistencies could float around in the database.  For
>example, I might ask for a view of Tn5 and get one sequence, but because of a
>bug in the code, get a different one by asking a different way.  (Example:  two
>different merge programs might exist and not do the same thing.)  The
>Biologist's solution avoids this entirely by keeping the data in the form it is
>going to be used most often.  Notice that both approaches require a commitment
>to do the merges.  I fear that if the Historian's approach is taken, this
>commitment will be allowed to slide, and the Biologist will be unable to work
>efficiently.  This is indeed the situation today.

In the long run, I don't think that it will be possible to avoid using a
'view of the data' , or virtual data, or whatever you want to call it. One
of the reasons I disagree with Tom's 'biologist' vs. 'historian'
distinction is that different disciplines of biology view the same thing in
very different ways, and in many cases, there is no 'right' view. The
Mendelian concept of a gene (something that gives you a detectable
phenotype) is not at all in conflict with the molecular definition (a piece
of nucleic acid that can confer a discrete phenotype, or however you want
to define it).  So it's not unreasonable that as databases become
interconnected into a grand Biomatrix, there will have to be many diffent
views of the data. 
>To summarize, the issues of names and merging are tangled with the
>possibilities of using computer languages like Delila and its progeny such as
>DNA STAR (TM?, by Fred Blattner) to manipulate the sequence database.  To
>support these languages the database must have certain properties.  As the
>database grows, it will become critical to have access through languages
>because we won't be able to deal with the data any other way.


Brian Fristensky                | Spock: It is illogical to hunt a species
Department of Plant Science     |        extinction. 
University of Manitoba          | 
Winnipeg, MB R3T 2N2  CANADA    | Marine biologist (flabbergasted): Uh, YES! 
frist at          | 
Office phone:   204-474-6085    | 
FAX:            204-275-5128    | Star Trek IV

More information about the Bioforum mailing list