Database entry cross references

David Mathog mathog at seqvax.caltech.edu
Mon Oct 28 15:03:00 EST 1991


Here's my two cents worth on database entries referencing
other database entries. (Apologies if I missed any postings 
presenting the following arguments and for losing the 
name of the author of the following quote).  I agree with the
statement made earlier in this string that: 

 (c) It is better to refer to features from another entry by absolute
 coordinates (eg. X30405:11..238) rather than by label (eg.
 X30405:magA_protein)

The following two cases suggest why the first form should be 
used.  (This argument only applies to cross-references to
features that map to a region of sequence.  Different rules
may apply when cross references are allowed to unstructured
text, such as the comment field). 

Case 1)  Entry 1 references a range in Entry 2 that at the
time the latter was submitted had no known significance or
was improperly delimited. In particular, when overlapping
sequences are submitted by different labs the first
submission will generally contain no references to the
second. In this case the <accession#>:range format is not
only appropriate, it is mandatory. The other format CANNOT
be used because no tag is defined. 

Case 2)  Entry 1 references an object that is correctly
defined in Entry 2 (with /note="magA_protein",
range=11..238).  In this case the <accession#>:label format
_appears_ to be superior since it will pull out the range
and any other associated comments via the magA_protein tag. 
However, the other format also yields the same
information after a bit of work. Put another way, the
"objects" (used _loosely_) in the features table can be
retrieved by any of their values, even though some values,
such as "exon" will match multiple objects.  And yes, I do
agree that it is easier to think in terms of "magA_protein"
than 11..238, but that isn't the question here. 

The big problem with either cross reference format is that
modifying entry 2 can invalidate the cross reference in
entry 1.  I suggest that this database maintenance problem
results more from the lack of a back reference in the target
entry than it does from the cross reference format.  That
is, if people are to be allowed to submit an entry 1 that
references entry 2, then they must also be allowed to submit
a modification to entry 2 of the form "referenced by entry
1".  Later modification of entry 2 would entail checking the
noted cross references and patching _both_ as needed. I
don't see how we can avoid the use of doubly linked lists
(instead of the present singly linked lists) once
significant numbers of partially overlapping sequences start
pouring out of the genome projects. 

David Mathog
mathog at seqvax.caltech.edu



More information about the Bioforum mailing list