Database entry cross references

frist at ccu.umanitoba.ca frist at ccu.umanitoba.ca
Tue Oct 29 19:03:12 EST 1991


In article <28OCT199112032429 at seqvax.caltech.edu> mathog at seqvax.caltech.edu (David Mathog) writes:
>Here's my two cents worth on database entries referencing
>other database entries. (Apologies if I missed any postings 
>presenting the following arguments and for losing the 
>name of the author of the following quote).  I agree with the
>statement made earlier in this string that: 
>
> (c) It is better to refer to features from another entry by absolute
> coordinates (eg. X30405:11..238) rather than by label (eg.
> X30405:magA_protein)
>
>The following two cases suggest why the first form should be 
>used.  (This argument only applies to cross-references to
>features that map to a region of sequence.  Different rules
>may apply when cross references are allowed to unstructured
>text, such as the comment field). 
>
>Case 1)  Entry 1 references a range in Entry 2 that at the
>time the latter was submitted had no known significance or
>was improperly delimited. In particular, when overlapping
>sequences are submitted by different labs the first
>submission will generally contain no references to the
>second. In this case the <accession#>:range format is not
>only appropriate, it is mandatory. The other format CANNOT
>be used because no tag is defined. 
The easy solution here is that if no appropriate target feature
exists in Entry 2, the target, then create one! It might look
like this:

Entry 1 (X00532):

CDS                join(23..546,X76301:anyname)
                   /label=junk_protein

Entry 2 (X76301):   

misc_feature       446..789
                   /label=anyname
                   /usedin=X00532:junk_protein 

Regardless of the changes made to Entry2, Entry1 will not have to
be changed at all.  

>Case 2)  Entry 1 references an object that is correctly
>defined in Entry 2 (with /note="magA_protein",
>range=11..238).  In this case the <accession#>:label format
>_appears_ to be superior since it will pull out the range
>and any other associated comments via the magA_protein tag. 
Yes, and this is one of the great strengths of the label approach.
With an absolute range, you do NOT have any straightforward way of
bringing along the qualifier lines, which is exactly what GETOB does.
Having the associated qualifier lines is vital, because these can often
tip you off to things like pseudogenes.

>However, the other format also yields the same
>information after a bit of work. Put another way, the
>"objects" (used _loosely_) in the features table can be
>retrieved by any of their values, even though some values,
>such as "exon" will match multiple objects.
As I pointed out previously, in the absence of a unique label, it is
harder to distinguish among multiple objects of the same type (eg.CDS)
within an entry.

>And yes, I do
>agree that it is easier to think in terms of "magA_protein"
>than 11..238, but that isn't the question here. 
>The big problem with either cross reference format is that
>modifying entry 2 can invalidate the cross reference in
>entry 1.
This will always happen with coordinates. While I agree in principle
that it is possible for this to happen with labels, I can't think of
a specific case in which modifying entry2 (without changing the labels) would
invalidate the reference in entry1. In any event, labels minimize the effects
that changes in a target entry have on an entry calling the target.

> I suggest that this database maintenance problem
>results more from the lack of a back reference in the target
>entry than it does from the cross reference format.  That
>is, if people are to be allowed to submit an entry 1 that
>references entry 2, then they must also be allowed to submit
>a modification to entry 2 of the form "referenced by entry
>1".  Later modification of entry 2 would entail checking the
>noted cross references and patching _both_ as needed. I
>don't see how we can avoid the use of doubly linked lists
>(instead of the present singly linked lists) once
>significant numbers of partially overlapping sequences start
>pouring out of the genome projects. 

That's right, back references (eg. /usedin=) are unavoidable if we are
going to allow one feature to refer to another. In the long run, the use
of labels instead of absolute locations will be of immense value in
decreasing the rate of errors introduced as sequences are merged and
cross referenced. 
>
>David Mathog
>mathog at seqvax.caltech.edu

In the absence of labels, 
a big problem arises at the user end. While  the coordinates
may be updated in Entry 1, whenever changes are made in entry 2,
changing coordinates will invalidate user-generated datasets that
used the old coordinates. For example, say you are maintaining a 
personal database mature peptide coding sequences for plastid iron-sulfur
ferredoxins.
  
The best way to do it is            Instead of this: 
like this:
M22345:frxA                         M22345:3002..3456
K33294:frx1                         K33294:128..529
X33245:ferfesu                      X33245:complement(12394..12031)
etc...

If you used absolute coordinates, then when one or more of these
sequences gets merged into an entry for the complete plastid genome for
that species, every entry using that feature would have to be changed as
well. With labels, changes in the target entries shouldn't affect entries
that use them. 

In ANY programming
task, it is better to have a few key variables upon which other things depend.
On our local system,  we set one environment variable called $db, that
holds the name of the directory in which GenBank, PIR, programs,
documentation are all stored. Thus, the directory containing GenBank is
$db/GenBank. The important thing here is that even if $db has to change,
the programs that access it do not. THE MORE YOU HARDWIRE THINGS INTO
THE CODE, THE MORE CHANCE THAT YOU'LL MISS SOMETHING WHEN YOU'RE UPDATING
IT. This is a fundamental principle of good programming, and FEATURES is as
much a programming language as Pascal or C.  

I must correct a statement that I made in a previous posting, which was
that labels (or tags, if you prefer) need only be unique within an entry,
and not within the database. So, in the example given above, it would not
be confusing if the Iron-sulfur ferredoxin from two different entries
were both labeled as 'frxA', for example, because they are always referred
to along with their accession number. 

A problem arises, however, when you merge two entries, each of which
contain features with the same label. If both of the Iron-sulfur
ferredoxins with the label frxA were, parts of the same plastid genome, and
were later merged with other sequences to form the complete genome, then
both features would have the same name.

While it may not be the ideal solution, I can suggest one way out of this
dilemma. Perhaps someone else can do better. Anyway, one answer is to
incorporate the accession number into the label field, such as

/label=M22345:frxA

This has the advantages of keeping a mnemonic label, while at the same
requiring no real change in the existing syntax. Furthermore, it's guaranteed
to be unique within the database.

My final point is this. While there may be some problems inherent in using
labels in the database, they are solveable, and their advantages far 
outweigh these probelms. 
   
===============================================================================
Brian Fristensky                | Spock: It is illogical to hunt a species to
Department of Plant Science     |        extinction. 
University of Manitoba          | 
Winnipeg, MB R3T 2N2  CANADA    | Marine biologist (flabbergasted): Uh, YES! 
frist at ccu.umanitoba.ca          | 
Office phone:   204-474-6085    | 
FAX:            204-275-5128    | Star Trek IV
===============================================================================



More information about the Bioforum mailing list