In view of recent discussions on the biosci bulletin-boards about changes
to database formats, I am posting here text extracted from the release notes
of the upcoming release of the EMBL Nucleotide Sequence Database (Rel 24,
Aug 90).
Peter Stoehr
EMBL Data Library
-----------------------------------------------------------------------------
1 CHANGES AT THIS RELEASE (Release 24, August 1990)
1.1 New Feature Table
Experience in trying to represent some of the more complex features of
nucleotide sequences led both ourselves and GenBank to the conclusion that the
old style of feature table was inadequate. EMBL, GenBank and the DNA Data Bank
of Japan have completed the design of a new, common, feature table format which
we are introducing at this release.
If you would like to receive details of the new feature table format then please
contact us (by post, telephone or electronic mail) at the address shown on the
cover page of this document.
A brief introduction to the new format is supplied as the file FTABLE.DOC on the
release tape.
1.2 New DR (Database Cross-Reference) Line
This new line type cross-references other databases which contain information
related to entries in the EMBL nucleotide sequence database.
For example, if the protein translation of a sequence exists in the SWISS-PROT
or PIR databases there will be DR lines pointing to the relevant SWISS-PROT or
PIR entries. If the atomic coordinates of these SWISS-PROT or PIR entries are
stored in the Brookhaven Protein Data Bank (PDB) there will be DR line(s)
pointing to the corresponding entry(ies) in that data bank.
The format of the DR line is as follows:
DR database_identifier; primary_identifier; secondary_identifier.
The first item on the DR line, the database identifier, is the abbreviated name
of the data collection to which reference is made. The initial set of
cross-referenced databases are:
Database ID Fullname
----------- --------------------------------------------------------
HIV The HIV Sequence Database
PDB The Brookhaven Protein Data Bank (PDB)
PIR The Protein Sequence Database of the Protein
Identification Resource (PIR)
SWISS-PROT The SWISS-PROT Protein Sequence Database
The second item on the DR line, the primary identifier, is a pointer to the
entry in the external database to which reference is being made. The data item
used as the primary identifier depends on the database being referenced:
Database ID Primary Identifier
----------- ------------------
HIV Accession number
PDB Entryname
PIR Accession number
SWISS-PROT Accession number
The third item on the DR line, the secondary identifier, is used to complement
the information given by the primary identifier. Again, the data item used
depends on the database being referenced:
Database ID Secondary Identifier
----------- ----------------------------------------------
HIV Entryname
PDB Most recent revision date (last REVDAT record)
PIR Entryname
SWISS-PROT Entryname
Some examples of complete DR lines are shown below:
DR HIV; K02013; NEF$BRU.
DR PDB; 3ADK; 16-APR-88.
DR PIR; A02768; R5EC7.
DR SWISS-PROT; P03593; V90K$AMV.
2 FORTHCOMING CHANGES
2.1 RN Line Format
Each reference block in a database entry currently contains exactly one RN line
which represents three different pieces of information: the number of the
reference within the entry, the base span(s) covered by the reference, and an
optional comment. The RN line is formatted as follows:
RN [n] (bases i-j, k-l, m-n, ...) comment
The restriction to one RN line per reference block imposes an arbitrary limit on
the number of base spans which can be specified for a reference, and in order to
remove this restriction we will change the RN line format at the next quarterly
release (i.e. Release 25 in November 1990).
The current RN line will be replaced by three line types: a modified RN
(Reference Number) line type containing just the reference number, a new RC
(Reference Comment) line type containing just the reference comment, and a new
RB (Reference Base) line type containing just the base spans covered by the
reference.
RN [n]
RC comment
RB i-j, k-l, m-n, ...
Each reference block will continue to have exactly one RN line. As many RC
lines as are needed to display the reference's comment will appear. If a
reference has no comment then the RC line will not appear. As many RB lines as
are needed to display the reference's base spans will appear. If a reference
has no base spans then the RB line will not appear.
2.2 DT Line Format
We have decided to change the information we supply on DT lines, in order to
satisfy two of the most common requests for enhancements we receive: to provide
an easy way of determining when an entry first appeared in the database and when
it was last updated.
As from the next quarterly release (i.e. Release 25 in November 1990) each
database entry will contain exactly two DT lines, which will indicate when the
entry first appeared in the database and when it was last updated. Each entry
will also receive a version number, which will be incremented by one every time
the entry is updated. The DT lines will be formatted as follows:
DT DD-MMM-YYYY (Rel. #; Last updated; Version #)
DT DD-MMM-YYYY (Rel. #; Created)
For example:
DT 12-APR-1990 (Rel. 23; Last updated; Version 3)
DT 10-MAR-1990 (Rel. 22; Created)
Note that the format of the DT line is unchanged (i.e. a DD-MMM-YYYY date
followed by parenthesised text); what we have done is to rigorously specify the
text which appears in parentheses after the date.
The version number will only appear on the "Last updated" DT line. If an entry
has not been updated since it was created, it will still have two DT lines and
the "Last updated" line will have the same date (and release number) as the
"Created" line. The date supplied on each DT line indicates when the entry was
created or updated; that will usually also be the date when the new or modified
entry became publically visible, via our file server. The release number
indicates the first quarterly release made *after* the entry was created or last
updated.
2.3 Lowercase Sequences
The EMBL Data Library and GenBank, along with many other groups who deal
extensively with sequence data, have long noted that the presentation of
sequences using lowercase letters significantly improves the accuracy of human
readers who have to deal with them. Since the use of lowercase letters is now
allowed in the IUPAC-IUB standard, we will switch to a lowercase presentation of
sequences as from the next quarterly release (i.e. Release 25 in November 1990).
2.4 Taxonomic Information
We will make the following changes to the way in which taxonomic information is
represented in the database as from Release 26 in February 1991.
2.4.1 New OG (Organelle) Line
A new linetype will be introduced, to indicate the location of non-nuclear
sequences. It will only be present in entries containing non-nuclear sequences
and will appear after the last OC line in such entries.
The OG line will contain one data item, either "Mitochondrion", "Chloroplast",
"Kinetoplast" or a plasmid name (e.g. "Plasmid pBR322").
OS lines of non-nuclear entries will no longer be prefixed by "Mitochondrion",
"Chloroplast" or "Kinetoplast"; this information will only appear on the OG
line. We will also abandon the use of separate taxonomic trees for
chloroplastida and mitochondria.
For example, the current:
OS Chloroplast Euglena gracilis (green algae)
OC Chloroplastida; Planta; Phycophyta; Euglenophyceae.
will become:
OS Euglena gracilis (green algae)
OC Eukaryota; Planta; Phycophyta; Euglenophyceae.
XX
OG Chloroplast
2.4.2 Hybrids
Hybrids will be handled by repeating the OS/OC lines for each source organism in
the hybrid. A human/mouse hybrid, for example, will appear as follows:
OS Homo sapiens (human)
OC ... OC for humans ...
XX
OS Mus musculus (mouse)
OC ... OC for mice ...
2.4.3 Unknown Sources
In cases where the source organism is unknown, the taxonomy on the OC line(s)
will be as specific as possible and the OS line will be "OS Unknown". For
example:
OS Unknown
OC Prokaryota; Bacteria.
2.4.4 Artificial Sequences
A new taxonomic node, "Artificial sequences", will be introduced at the same
level as "Prokaryota", "Eukaryota", etc. It will have (at least initially) two
child nodes: "Cloning vectors" and "Synthetic genes".
2.4.5 Plasmids
For naturally occurring plasmids the OS/OC lines will contain the source
organism and the plasmid name will appear on an OG line. For example:
OS Escherichia coli
OC Prokaryota; ... Enterobacteriaceae.
XX
OG Plasmid colE1
For artificial plasmids the OS line will be "OS None" and the sequence will be
classified as a cloning vector. The plasmid name will appear on an OG line.
For example:
OS None
OC Artificial sequences; Cloning vectors.
XX
OG Plasmid pBR322
Where only a naturally occurring part of a plasmid is reported, the plasmid name
will appear on the OG line and the OS/OC lines will describe the natural source.
For example:
OS Escherichia coli
OC Prokaryota; ... Enterobacteriaceae.
XX
OG Plasmid pUC8