Beyond GenBank

Jim Ostell ostell at object.nlm.nih.gov
Tue Dec 22 08:38:01 EST 1992


    NCBI is aware of many of the issues raised in this 
discussion.  Rather than being "afraid to face these tough issues" as 
we have been moving toward possible solutions 
internally at NCBI for the past two years, as many people are aware who 
have offered to participate constructively in the process.  In addition, you all must realize that 
NCBI has only been given authority over GenBank this October.  It will 
take more than a couple of months to make substantial progress.

  Further, GenBank is an existing international collaboration, so 
any changes need to be acceptable to the collaborative groups, or 
invisible to them.  Finally, there is a very large contingent of users 
who may be less cognizant of the problems with GenBank today, who are 
very invested in not having the familiar change.  Their needs must be 
addressed by any strategy taken by NCBI as well.

    Rather than quoting individual comments, now that a great deal of 
discussion has transpired, I would like to focus on issues, comment on 
what we see the problems to be, then present our plan for addressing the 
issue.

Requirements for Creating New Databases:

    A central theme in the NCBI strategic plan is support for databases 
built by outside domain experts, yet incorporated into unified user view 
in the central sequence databases.  In order to accomplish this there 
must be:

    1) a standard computer readable data exchange language.  It must be 
richer and more flexible than the flatfile format, formally correct as a 
language, yet not force a particular hardware platform, programming 
language, or database technology on the scientific community.

    2) a data specification which rigorously defines core objects (such 
as sequences, maps, coding regions) yet allows both the addition of 
custom extensions to existing defined objects and the creation of 
totally new objects.  A migration path must exist for moving the user 
defined objects to the core standard set as certain definitions prove to 
be of widespread utility by dint of experience.

    3) stable identifiers for sequences must be supported by central 
databases.  The somewhat casual relationship of LOCUS and ACCESSION with 
particular sequence is intolerable if other investigators build 
databases which cite locations on these sequences.  The ability to 
stably cite features is even more complex, see discussion below.

    4) the data model must allow incorporation of data of various types 
from different sources.  The same data must be able to participate in 
different views of the database (eg.  a "typical" beta globin region vs. 
all original pieces of sequence containing a beta globin coding region 
in the database).

NCBI Approach to Problem 1:

    We have chosen a data exchange language called ASN.1, Abstract 
Syntax Notation 1.  It is an International Standards Organization 
standard (ISO 8824, 8825) for exchange of structured data in a formal, 
yet machine and implementation independent way.  This is not another ad 
hoc file format invented for a special purpose by biologists.  It 
separates the definition of the data structure from any particular block 
of data.  This means that the specification is necessary and sufficient 
to describe data conforming to it from any source.  The specification is 
not a passive documentation of a file format, but is used by software to 
actively check a data stream for accuracy.  Anyone who has been parsing 
flatfiles will appreciate the value of full, automatic data checking by 
machine.

    ASN.1 supports modular specification.  That is, one may have a 
module specifying bibliographic entities.  This module can then be 
simply referenced by other modules, such as a sequence module or a 
MEDLINE module, rather than coming up with a new bibliographic component 
for every new database.  Like modular programming, modular data 
specification has profound benefits for data and code reusability and 
maintainability.  The modular design also greatly facilitates linkage 
between databases because they may differ in overall content, but share 
certain defined entities such as literature citations or sequence 
identifiers, which will be compatible with each other, and thus provide 
an avenue for automatic linkage of the other data elements.

    In order to facilitate use of ASN.1, NCBI provides software tools 
for developing specifications, validating them, automatically generating 
parsers for any specification, and tools for reading and writing ASN.1 
structured data that run on 14 different hardware and software 
platforms.  See below for tool availability.

2) NCBI Approach to Problem 2:

    We have done specifications in ASN.1 for biological sequences, 
including nucleic acids, proteins, and maps of various types.  We have 
an extensive specification for bibliographic information, including 
articles, journals, books, thesis, manuscripts, patents, etc., which 
conforms to the ANSI standard for bibliographic citations.  We have a 
specification for MEDLINE.  We have specifications for a variety of 
features, for alignments of sequences, and for graphs of sequence 
properties.

   The specification has been tested by mapping all of GenBank, EMBL, 
DDBJ, SWISSPROT, PIR, and PRF into ASN.1 conforming to the spec.  We 
have also mapped MEDLINE, and the sequences from the Brookhaven 
structural database, among other things.  Thus the specification is a 
superset and a unification of most major existing sources of sequence and 
their annotations.  Much of this has been appearing on the Entrez disks 
for some time.  More will appear over the year.

    In addition to integrating the sequence databases themselves into a 
single entity, we have also been addressing the issue of contributed 
information ABOUT the sequences.  We worked with Philipp Bucher, author 
of the Eukaryotic Promoter Database (EPD) on the TxInit (transcription 
initiation) feature definition.  He produces EPD as an ASN.1 formatted 
feature table on every release of EPD.

    The ASN.1 specification allows a sequence to have multiple feature 
tables on the same entry, with attribution to the source.  So we will be 
adding EPD information automatically to the sequence data appearing in 
our ASN.1 releases in the near future.  This allows a very rich 
annotation to be provided by a specialist on their own local system, but 
to be automatically presented in a user view as an integrated part of 
the database.  It is our plan to expand this aspect in a big way once we 
have stabilized the sequence data itself (remember we have been GenBank 
only a couple of months).

    The ASN.1 spec supports a "User-defined Object".  This allows the 
attachment of structured data defined by the user both to existing 
features (as an extension) (eg. a CdRegion with an extension with more 
information about the translation process), or as completely new feature 
type when something is so new it is more than an extension to an 
existing type.  User-defined types are transparent in ASN.1, yet support 
a completely structured datatype that user code can operate on.  It 
provides an unmoderated forum for new ideas, which code can ignore or 
take advantage of.  If a user defined type becomes popular or important, 
then there already exists a definition for it and possible pre-existing 
data in the database which could be reliably converted to a new standard 
type.

NCBI Approach to Problem 3:

    In order for outside scientists to cite a sequence location and then 
compare it with other data at a later time, the database must provide 
stable identifiers for sequences.  It must be understood that GenBank 
itself is an international collaboration, and, in addition, we are 
adding other sequence data not traditionally part of GenBank such as 
proteins.  This means NCBI cannot simply stabilize sequence ids by fiat.

    However, we are building a database called ID, whose job it is to 
impose stable IDs, called GI (GenInfo) numbers.  A GI is an arbitrary 
unsigned integer which identifies a specific sequence.  If anything in 
the sequence changes (a 1 bp change is enough) it is assigned a new GI.

    Bioseqs, in the ASN.1 definition, can have multiple ids.  So, an 
entry that comes from EMBL, say X12345, would entry ID the first time 
and be assigned a GI, say 10.  In the ASN.1 form of that Bioseq, it 
would have both ids, EMBL X12345, and GI 10.  All feature locations, 
etc., would be converted from an EMBL id to GI 10 on input.  Then 
suppose ID gets an update, which is only to the feature table of X12345.  
ID looks up the old X12345, compares the old sequence to the new, and, 
since they are identical, gives the new entry GI 10 again.  Now, suppose 
the sequence for X12345 is changed, but the accession stays the same.  
When ID compares the new sequence with the old, it sees it changed, so 
it assigns a new GI, say 15.  It also adds a history to the ASN.1 form 
of the entry.  GI 15 gets a pointer saying that it used to be GI 10.  GI 
10 gets a pointer that says it has been replaced by GI 15. A release of 
Entrez made now, would only have the GI 15 entry.  ID would still 
contain both GI 10 and GI 15 however.

   A feature submitted that cited GI 10 could reliably be integrated 
into a release on the fly.  When GI 10 is replaced by GI 15, the 
contributor of the feature citing GI 10 could be notified that their 
entry may be invalid and to look at GI 15.  They could confirm that 
their annotation in fact still applies to GI 15 and resubmit.

   When one retrieves from ID based on accession X12345 you get the 
latest entry with that accession, GI 15.  However, if you retrieve with 
GI 10, you get the original GI 10 entry, plus the additional information 
that it now has been replaced by GI 15.  Thus ID can provide a data 
system which can operate both on the old unstable ID system as well as 
impose a new, parallel stable id system.

NCBI Approach to Problem 4:

    In addition to features, the ASN.1 specification supports alignments 
and sequences constructed by assembly of other sequences.  This allows 
submission by outside sources of sequence merge information, placement 
of sequences on genetic and physical maps, assemblies of published 
sequences under a "prototypical" representative, and so on.  These 
constructs can be used both to provide new insights on sequence 
relationships as well as for an author to provide a history of changes 
to a sequence as it is updated or added to.  We are doing a prototype 
project of this sort of thing with the Kenn Rudd E.coli database and 
with Elvin Kabat's Proteins of Immunological Interest.  Already in the 
data released in Entrez, we have assembled segmented sequences from 
GenBank into such higher level entities with pointers to their 
components.

    Another type of assembly allowed by the ASN.1 specification is the 
grouping of related sequences together.  In the current En



More information about the Bio-soft mailing list