GenBank Meta-Entries

Tue Apr 21 13:12:30 EST 1992

In response to the posting of Michael Ashburner (MA11 at
on 17 Apr 92 (Message: A58E856B48B28440 at UK.AC.CAMBRIDGE.PHOENIX):

We were happy to see Michael's note because he expresses a position closely
related to that adopted by the PIR-International Protein Sequence Database.
In the Protein Sequence Database, new accession numbers are assigned to each
sequence as reported, i.e., the unmerged ("atomic") form.  We differ slightly
from the model presented by Michael in that, rather than presenting the
unmerged reports and instructions for constructing the `meta-sequence',
`virtual sequence', etc., we present the merged sequence and instructions for
regenerating the originally reported sequences.  The unmerged forms are
preserved in an Archive, separate from the database; these data are not
currently distributed but will be made available in the future.

Both approaches are formally equivalent; the differences are practical not
theoretical.  The most efficient approach is to store the form that is most
often used.  GenBank has chosen the atomic approach because they view their
primary role as that of a data repository for recording independently reported
sequences and they work with the data primarily as a set of independently
reported sequences.  We have chosen the merged approach because we believe
the Protein Sequence Database is best formulated as a scientific database
reflecting the current understanding of the information.  We work primarily
with the sequence in its merged form.  This approach has the additional
advantage of being less redundant.

A mechanism for updating the information must be developed irrespective of
which approach is adopted.  Either the `canonical' sequence (and the
instructions for decomposing the sequence if the canonical sequence changes)
must be updated, as in the Protein Sequence Database, or the instructions for
merging the sequence from individual reports must be updated.  As Michael
pointed out, neither approach can be successful without the aid of software
specifically designed for the task.

We have designed a syntax for representing instructions for regenerating the
originally reported sequences.  This syntax has been employed in the database
for the past several years; please refer to the bulletin board message we
posted 19 Mar 92 (Message: 9203192259.AA24615 at on the
relationship between locus name and accession number.

The syntax itself explicitly lists the discrepancies between the reported
sequences with respect to the canonical form.  This information is extremely
valuable because it distinguishes regions on the sequence that are uncertain.
On the other hand, the residues at the positions where various laboratories
agree can be accepted with a much higher degree of certainty.  Presenting the
sequence data in merged form provides a direct mechanism for assessing the
reliability of the sequence data.

We have developed software that reconstructs the originally reported sequences
from this syntax and the canonical sequence.  Presently this software is
confined to VAX/VMS platforms; it is publicly available for these systems.
We hope to make a platform-independent version of the code available before the
end of summer.  We are planning to develop a `workbench' for merging protein
sequences and will provide reports on the progress of this project.  Currently
merging is done by the database staff with tools that are not suitable for
public distribution.

Michael very nicely points out an important role for external experts in the
development of biological databases.  Up until about 1978, the Protein Sequence
Database project had maintained a strong policy of external review.  It is
significant that this policy was discontinued not because it was not effective,
but because coordinating this effort was too expensive.  We were unable to
obtain sufficient resources for continuing external review on the required
scale.  There are significant administrative and organizational problems that
must be addressed for such an approach to be effective on a large scale.  We
are formulating a plan to resume external review of the database by experts
from the scientific community.  This plan will include a high level of support
for these activities by the database staff.  While powerful software is
required, we believe that software is not sufficient.
                                 David George
                                 George at GUNBRF.bitnet
                                 Protein Identification Resource
                                 National Biomedical Research Foundation
                                 Georgetown University Medical Center
                                 Washington, DC  20007

More information about the Proteins mailing list