Release 53 of the EMBL Nucleotide Sequence Database is now available from the
EMBL-EBI ftp site: ftp://ftp.ebi.ac.uk/pub/databases/embl/release
This release is already installed on other EBI query/search servers,
SRS/Blast/Fasta. A few extracts from the release notes follow, but they can
be read in full at:
ftp://ftp.ebi.ac.uk/pub/databases/embl/release/relnotes.doc or
http://www.ebi.ac.uk/ebi_docs/embl_db/relnotes53/relnotes.html
For those who maintain local copies using our cumulative update file in
ftp://ftp.ebi.ac.uk/pub/databases/embl/new/cumulative.dat.Z, please note that
this file will be changed to include only post- release 53 data on Friday
Jan 9th at 15:00 GMT (ie it gets smaller then).
Regards,
Peter Stoehr
EMBL-EBI
--------
1 RELEASE 53
The EMBL Nucleotide Sequence Database was frozen to make Release 53 on the 16th
December 1997. The release contains 1,917,868 sequence entries comprising
1,281,391,651 nucleotides. This represents an increase of about 8% over Release
52. A breakdown of Release 53 by division is shown below:
Division Entries Nucleotides
----------------- ------- -----------
Bacteriophage 1388 2188305
ESTs 1343796 496603984
Fungi 18137 44602064
GSSs 100154 49099107
HTG 1868 102763872
Human 74384 139022655
Invertebrates 28126 107524431
Organelles 24715 22870076
Other Mammals 14429 13785092
Other Vertebrates 13145 14653255
Plants 22136 37736590
Patent 91221 29511807
Prokaryotes 42666 102750354
Rodents 37043 46489741
STSs 51172 17685717
Synthetic 2424 5377292
Unclassified 2380 2387088
Viruses 48684 46340221
----------------- -------- ------------
Total 1917868 1281391651
----------------- -------- ------------
1.1 EST Database Files
In order to keep the size of the data files within reasonable limits for
handling purposes, we have split the EST division into several files. At this
release we have created one extra files of EST data named EST14.DAT. Additional
files will be added in subsequent releases as appropriate.
1.2 Feature Table Qualifiers
1.2.1 New SOURCE Qualifier /specimen_voucher
This new source feature qualifier valid at this release indicates the source of
a sample , eg a museum identification tag, of the sequenced material.
Qualifier /specimen__voucher="text"
Definition an identifier of the individual or collection of the source
organism and the place where it is currently stored, usually
an institution.
Value format "text"
Example /specimen__voucher="Smith s. n. 4-IV-1995 (U. S. Natl.
Herbarium)"
1.2.2 New /focus
This new source feature qualifier valid at this release defines the main source
feature for records with more than one source feature (e.g. proviral/cellular
sequences)
Qualifier /focus
Definition defines the preferred source feature for records that
have more than one source feature
Value format none
Example /focus
Comment this qualifier is to be used only if there is more than
one source feature. The preferred source feature is
used to determine which organism is displayed in
the SOURCE and ORGANISM lines and to determine the
EMBL division in which it is placed.
For sequences derived from more than one organism, and therefore containing more
than one 'source' feature key, the /focus qualifier will be attached to the
source key which represents the major organism, that which was the focus of the
sequencing effort. If no translation table is specified, the organism with
/focus will define the translation table.
2 FORTHCOMING CHANGES
2.1 Nucleotide And Protein Identifiers
2.1.1 Nucleotide Indentifiers
The NI linetype of the EMBL flat-file format currently contains a unique
identifier for the nucleotide sequence. While the sequence remains the same, so
does the value of this identifier. When a sequence change occurs, however
minor, a new NI value will be assigned whilst the accession number on the AC
line may remain unchanged. These identifiers are collaboratively maintained
with GenBank and DDBJ, for example:
NI g21954
It has become clear from users and other database groups that confusion has been
created about the relationship between these identifiers and the GenBank 'gi'
numbers.
It has been decided therefore to introduce a new system of nucleotide
identifiers of the form 'accession.version', eg: X12345.3, where the accession
number part will be stable, but the version part will be incremented if the
sequence changes.
Subject to synchronisation of this change with GenBank and DDBJ, we plan to
implement this new form of nucleotide identifier during 1998.
2.1.2 Protein Identifiers
Protein identifiers are currently assigned to all CDS features in the nucleotide
sequence database and are found in the feature table qualifier /db_xref, eg:
/db_xref="PID:e123456789"
As for nucleotide identifier values (above), confusion resulted amongst users
concerning the relationship of these to GenBank 'gi' numbers also assigned to
CDS features. To clarify this, and to adopt a comparable scheme of identifiers
for both nucleotides and proteins, the collaborating databases have decided to
create a new feature table qualifier /protein_id, eg:
/protein_id="AAA12345.1"
This form of identifier also allows easier tracking of changing protein
identifiers by external databases than the previous PIDs.
This qualifier consists of a stable ID portion (3+5 format with 3 positions
letters and 5 numbers) plus a version number after a decimal point. The version
number will change only when the protein sequence coded by the CDS changes,
while the stable part will remain unchanged. This qualifier will be valid only
on CDS features which translate into a valid protein.
Subject to synchronisation of this change with GenBank and DDBJ, we plan to
implement this new form of protein identifier during 1998.
2.2 Feature Table Qualifiers
2.2.1 New /protein_id
As mentioned above a new feature table qualifier /protein_id will be created:
Qualifier /protein_id="<identifier>"
Definition Protein Identifier, issued by International collaborators.
This qualifier consists of a stable ID portion (3+5 format
with 3 positions letters and 5 numbers) plus a version number
as the decimal point.
Example /protein_id="AAA12345.1"
Comment Only when the protein sequence coded by the CDS changes, the
version number will change, while the stable part will remain
unchanged. This qualifier is valid only on CDS features which
translate into a valid protein. The list of 3-letter prefixes
will be maintained by EBI
Subject to synchronisation of this change with GenBank and DDBJ, we plan to
implement this new form of protein identifiers during 1998.
2.2.2 /translation And Related Feature Qualifiers
The collaborating databases DDBJ/EMBL/GenBank have decided that translation
related qualifiers should only be used with the primary CDS feature key.
These translation related qualifiers are:
/codon
/codon_start
/exception
/translation
/transl_table
/transl_except
Starting at release 54, translation related qualifiers will only be valid with
the CDS feature key and will be removed from the following list of non-CDS
features:
C_region
D_segment
exon
J_segment
mat_peptide
N_region
sig_peptide
S_region
transit_peptide
V_region
V_segment