EMBL Release 46 available
Peter Stoehr
stoehr at ebi.ac.uk
Thu Mar 14 13:17:01 EST 1996
Release 46 (March 1996) of the EMBL Nucleotide Sequence Database is ready
and available via anonymous FTP, email, and FASTA EBI network servers.
As a result of the new release, on Monday March 18 the cumulative update file
ftp://ftp.ebi.ac.uk/pub/databases/embl/new/cumulative.dat.Z will become smaller
and only contain data created after release 46 was built (28.2.96).
Below are some abstracts some the release notes. The fuller version is
available in ftp://ftp.ebi.ac.uk/pub/databases/embl/release/relnotes.doc
For those installing the new release into software systems such as GCG or SRS,
please note that there is an extra file EST5.DAT to deal with.
Regards,
Peter Stoehr
EMBL - EBI
----- EXCERPT FROM RELEASE NOTES ----------------------------------------------
1 RELEASE 46
The EMBL Nucleotide Sequence Database was frozen to make Release 46 on the 28th
February 1996. The release contains 701,246 sequence entries comprising
473,691,480 nucleotides. This represents an increase of about 11% over Release
45. A breakdown of Release 46 by taxonomic division is shown below:
Division Entries Nucleotides
----------------- ------- ------------
Bacteriophage 1155 1709794
ESTs 435315 151738729
Fungi 11202 29461528
Invertebrates 17455 52142164
Organelles 12692 13268750
Other Mammals 7724 8325527
Other Vertebrates 9019 10225112
Plants 14411 18129030
Primates 55763 52673518
Prokaryotes 28771 51378771
Rodents 28678 32367952
STSs 25900 8584341
Synthetic 11290 5473089
Unclassified 12226 5776683
Viruses 29645 32436492
----------------- ------- ------------
Total 701246 473691480
plus:
Other patents 3165 348946
----------------- ------- ------------
Grand Total 704411 474040426
1.1 Database Cross-references
At the previous release we introduced a new feature table qualifier "/db_xref"
to represent cross-references to external databases. This qualifier is valid,
but optional, for all feature keys. There are two components to the
cross-reference value, the name of the database and the identifier within that
database being referenced, formatted as follows:
/db_xref="database:identifier"
In this release, we have included cross-references using the "/db_xref" on CDS
features to the FLYBASE Drosophila database.
A cross-reference from a CDS feature to the database "FLYBASE" indicates that
this feature corresponds to the entity (eg gene name) in the FLYBASE database
with the given identifier, eg.
/db_xref="FLYBASE:FBgn0012052"
1.2 EST Database Files
In order to keep the size of the data files within reasonable limits for
handling purposes, we have split the EST division into several files. At this
release we have created a fifth file of EST data named EST5.DAT. Additional
files will be added in subsequent releases as appropriate.
2 FORTHCOMING CHANGES
2.1 Separation Of Human Sequence Data
At the next release (Release 47, June 1996) we intend to introduce a new
database division HUM for human (non-EST/STS/organelle) data. The primate (PRI)
division will be withdrawn and the small remainder of the primate sequence data
(circa 3000 sequences) will be merged with the other mammal (MAM) division.
We expect the volume of human data to increase dramatically in the next few
years. In order to keep database files at a manageable size, we will split the
HUM division into several database files, named HUM1.DAT, HUM2.DAT etc.. We
will add further database files in subsequent releases as appropriate.
2.2 *IMPORTANT* Notice Of Accession Number Format Change
Nucleotide Sequence Database Collaborative Agreement, 31 May 1995
Currently, accession numbers used by the nucleotide sequence databases consist
of one prefix letter followed by 5 digits. EST projects and projects to add
patent data have accelerated the need to extend the accession number space. It
is projected that the databases will run out of accession numbers within 8 to 10
months.
It is clear that:
* As much notice as possible should be given to users and software developers
* The change should make a large enough space that another change will not
be necessary in the foreseeable future.
* The accession number should continue to be readily identifiable as a
DDBJ/EMBL/GenBank accession number.
The collaborators concluded that:
* A new form of accession number will be created, defined as an
8-character alphanumeric string, beginning with two upper case
letters and followed only by digits (e.g., SR004562). Leading and
trailing zeros are significant. The letter 'O' will not be used.
* Existing 6-character accession numbers will remain as they are, and will
never be transformed to an 8-character form.
* New accession numbers will not be used before February 1, 1996. The groups
agree to avoid using new accession numbers as long as possible after that.
The International Nucleotide Sequence Databases
DDBJ/EMBL/GenBank
2.3 New Nucleic Acid Identifier (NI) Line
We intend to introduce a new line type NI to contain an identifier for each
nucleic acid sequence. While the sequence remains the same, so will the value
of this identifier. When a sequence change occurs, however minor, a new NI
value will be assigned whilst the accession number on the AC line may remain
unchanged. These NI values are analagous to those to be represented in the NID
lines of GenBank entries, and we will inherit GenBank NID values into our NI
lines. Starting at release 47 (June 1996), each entry will have an NI line of
the form:
AC U35111;
XX
NI g1006834
2.4 Taxonomy
The NCBI have recently put significant effort into a project 'Taxon' to create a
taxonomy database which reflects current phylogenetic knowledge. It is a
sequence-based taxonomy as much as possible, and based on published authorities
wherever possible. Taxon is being maintained by three NCBI scientists and
curated by a panel of established evolutionary molecular biologists.
We will incorporate this taxonomy into our database at an opportune moment in
the coming months, when a few operational details are resolved. At that time,
the OC lines of all entries will reflect the revised taxonomic classification.
----- END OF EXCERPT---------------------------------------------------------
More information about the Embl-db
mailing list