Database entry cross references

Tom Schneider toms at fcs260c2.ncifcrf.gov
Wed Oct 30 20:42:06 EST 1991


In article <28OCT199112032429 at seqvax.caltech.edu>
mathog at seqvax.caltech.edu (David Mathog) writes:

>Here's my two cents worth on database entries referencing
>other database entries. (Apologies if I missed any postings 
>presenting the following arguments and for losing the 
>name of the author of the following quote).  I agree with the
>statement made earlier in this string that: 
>
> (c) It is better to refer to features from another entry by absolute
> coordinates (eg. X30405:11..238) rather than by label (eg.
> X30405:magA_protein)

The quote is from Brian Fristensky (frist at ccu.umanitoba.ca).  Both he and I
disagree with this quote.

>The following two cases suggest why the first form should be 
>used.  (This argument only applies to cross-references to
>features that map to a region of sequence.  Different rules
>may apply when cross references are allowed to unstructured
>text, such as the comment field). 

There is no way to make a stable cross reference into
a comment field, so that is not relevant.

>Case 1)  Entry 1 references a range in Entry 2 that at the
>time the latter was submitted had no known significance or
>was improperly delimited.

I doubt that a cross reference would ever be created to a sequence whose
significance was not known or which did not have SOME kind of labeled mark,
such as an RFLP or a unique primer sequence.  In the extreme case of random
sequences being dumped into the database, the label could be the accession
number, and this would FROM THEN ON refer to the first base of that sequence.
In fact, this is a nice way to put labels on merged sequences.  If the
reference is to a section of DNA 500 bases before the start of a gene, one
would make the cross reference as:

  organism E.coli;
  get from protein(magA) start - 1000
      to   protein(magA) start - 500;

I will call this a "pure-label" cross reference.  Notice that, unlike the two
forms given above, there are NO absolute coordinates.  There is no accession
number necessary (unless that is the label itself of a particular base).

If the cross reference is "improperly delimited", then this is an error in the
database and the proper range should be substituted.

>In particular, when overlapping
>sequences are submitted by different labs the first
>submission will generally contain no references to the
>second. In this case the <accession#>:range format is not
>only appropriate, it is mandatory. The other format CANNOT
>be used because no tag is defined. 

Hunh?  If the other sequence does not exist at all, then one can't make a cross
reference!  If you meant that there is sequence, but no annotation, fine.  So
put the proper annotation in the second sequence at the same time the cross
reference to the second sequence is being made.

Case 1 does not seem to cause any trouble for pure-label cross references.

Notice that one could eliminate that ugly ACCESSION reference also by using the
species and strain numbers.  The result, as I showed in Delila-like form above,
is easy to remember and to understand.

>Case 2)  Entry 1 references an object that is correctly
>defined in Entry 2 (with /note="magA_protein",

Whoa there. :-)  Nothing which is inside a note can be used because notes are
free format, so you can't get a parser to figure out what to do with the
contents.  Suppose, instead that it was /gene="magA".

>range=11..238).  In this case the <accession#>:label format
>_appears_ to be superior since it will pull out the range
>and any other associated comments via the magA_protein tag. 
>However, the other format also yields the same
>information after a bit of work.

In 10 years all of this will have to be done by computer programs, and the
computers will be (conservatively) 10 times faster than today, so the small
amount of work involved in label lookups can be neglected.

>Put another way, the
>"objects" (used _loosely_) in the features table can be
>retrieved by any of their values, even though some values,
>such as "exon" will match multiple objects.

If you want to refer to a multi part object, that should be possible, but in
most cases (such as the construction of a virtual plasmid), this would be a
disaster.  This is a good reason for giving every intron and exon not only its
type (intron or exon) but also its genetic name.

> And yes, I do
>agree that it is easier to think in terms of "magA_protein"
>than 11..238, but that isn't the question here. 

The issue stems also from the difference between absolute and relative
instructions which I posted on earlier in my big Delila posting.  If you use
relative coordinates (i.e. distances from labeled sequence objects), then the
cross reference is insensitive to merges.  Indeed, they thrive on merged data.
With absolute coordinates (i.e. numbers on a sequence) you have to change some
of the cross references if merges are made.

Thus neither of your examples favor absolute cross references.  Pure label
cross references (ie, species, strain, genetic label, range relative to label)
are better in the long run.

>The big problem with either cross reference format is that
>modifying entry 2 can invalidate the cross reference in
>entry 1.

Absolutely a good point.

> I suggest that this database maintenance problem
>results more from the lack of a back reference in the target
>entry than it does from the cross reference format.

No, the problem exists any time one has a cross reference.  It is possible to
automatically check for cross references TO a particular sequence, and to
correct them as the need arises.

> That
>is, if people are to be allowed to submit an entry 1 that
>references entry 2, then they must also be allowed to submit
>a modification to entry 2 of the form "referenced by entry
>1".  Later modification of entry 2 would entail checking the
>noted cross references and patching _both_ as needed.

Unfortunately this would result in a huge number of back references in widely
used sequences such as lambda.  As I pointed out above, one only needs the
forward references as long as guard programs make sure that changes are
properly propagated.  Also, with a pure label system, there will only rarely be
changes to propagate.  For example, a base change within the range or a merge
to another sequence does not affect a pure label reference at all.  A deletion
or insertion (as from a sequence correction) would change the relative range.

> I don't see how we can avoid the use of doubly linked lists
>(instead of the present singly linked lists) once
>significant numbers of partially overlapping sequences start
>pouring out of the genome projects. 

Simple: merge them!  However, one interesting way to make a merge is to use
double cross references as you suggest.  The user, asking for a complete merged
sequence, would never need to see all the links.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms at ncifcrf.gov



More information about the Bioforum mailing list