Release 52 of the EMBL Nucleotide Sequence Database is available from the
EBI FTP server.
FTP server Directory
----------- ---------
ftp.ebi.ac.uk pub/databases/embl/release EMBL flat-file format
pub/databases/embl/release/gcg GCG8.x standard format
Some extracts from the release notes are appended below. Fuller details in:
ftp://ftp.ebi.ac.uk/pub/databases/embl/release/relnotes.doc
The cumulative update file pub/databases/embl/new/cumulative.dat.Z will be
changed at Thursday 10-NOV-1997 15:00 GMT to include only data created since
release 52.
Regards,
Peter Stoehr
EMBL-EBI
========
The EMBL Nucleotide Sequence Database was frozen to make Release 52 on the 17th
October 1997. The release contains 1,787,004 sequence entries comprising
1,181,167,498 nucleotides. This represents an increase of about 26% over
Release 51. A breakdown of Release 52 by division is shown below:
Division Entries Nucleotides
----------------- ------- -----------
Bacteriophage 1351 2132530
ESTs 1282047 472985339
Fungi 17338 43598348
GSSs 46472 24917886
HTG 1300 89642076
Human 71873 124647874
Invertebrates 27362 105437150
Organelles 23449 22009199
Other Mammals 13715 13190473
Other Vertebrates 12732 14134235
Plants 21031 34469699
Patent 90819 29402838
Prokaryotes 40292 90331355
Rodents 36085 44729959
STSs 50721 17472981
Synthetic 2382 5192851
Unclassified 2540 2377434
Viruses 45495 44495271
----------------- -------- ------------
Total 1787004 1181167498
----------------- -------- ------------
1.1 EST Database Files
In order to keep the size of the data files within reasonable limits for
handling purposes, we have split the EST division into several files. At this
release we have created two extra files of EST data named EST12.DAT and
EST13.DAT. Additional files will be added in subsequent releases as
appropriate.
1.2 Patent Sequences In PATENT.DAT
At this release we have grouped all entries derived from the patent literature
into one file PATENT.DAT. Previously we had merged such data into the other
divisions based on taxonomy, and PATENT.DAT only contained sequences which were
shorter than 15bp and which were known to be duplicates or reverse complements
of existing patent sequences.
2 FORTHCOMING CHANGES
2.1 Nucleotide And Protein Identifiers
2.1.1 Nucleotide Indentifiers
The NI linetype of the EMBL flat-file format currently contains a unique
identifier for the nucleotide sequence. While the sequence remains the same, so
does the value of this identifier. When a sequence change occurs, however
minor, a new NI value will be assigned whilst the accession number on the AC
line may remain unchanged. These identifiers are collaboratively maintained
with GenBank and DDBJ, for example:
NI g21954
It has become clear from users and other database groups that confusion has been
created about the relationship between these identifiers and the GenBank 'gi'
numbers.
It has been decided therefore to introduce a new system of nucleotide
identifiers of the form 'accession.version', eg: X12345.3, where the accession
number part will be stable, but the version part will be incremented if the
sequence changes.
Subject to synchronisation of this change with GenBank and DDBJ, we plan to
implement this new form of nucleotide identifier at release 54, March 1998.
2.1.2 Protein Identifiers
Protein identifiers are currently assigned to all CDS features in the nucleotide
sequence database and are found in the feature table qualifier /db_xref, eg:
/db_xref="PID:e123456789"
As for nucleotide identifier values (above), confusion resulted amongst users
concerning the relationship of these to GenBank 'gi' numbers also assigned to
CDS features. To clarify this, and to adopt a comparable scheme of identifiers
for both nucleotides and proteins, the collaborating databases have decided to
create a new feature table qualifier /protein_id, eg:
/protein_id="AAA12345.1"
This form of identifier also allows easier tracking of changing protein
identifiers by external databases than the previous PIDs.
This qualifier consists of a stable ID portion (3+5 format with 3 positions
letters and 5 numbers) plus a version number after a decimal point. The version
number will change only when the protein sequence coded by the CDS changes,
while the stable part will remain unchanged. This qualifier will be valid only
on CDS features which translate into a valid protein.
Subject to synchronisation of this change with GenBank and DDBJ, we plan to
implement this new form of protein identifier at release 54, March 1998.