EMBL Release 46 available

Peter Stoehr stoehr at ebi.ac.uk
Thu Mar 14 13:17:01 EST 1996

Release 46 (March 1996) of the EMBL Nucleotide Sequence Database is ready
and available via anonymous FTP, email, and FASTA EBI network servers.

As a result of the new release, on Monday March 18 the cumulative update file
ftp://ftp.ebi.ac.uk/pub/databases/embl/new/cumulative.dat.Z will become smaller
and only contain data created after release 46 was built (28.2.96).

Below are some abstracts some the release notes. The fuller version is
available in ftp://ftp.ebi.ac.uk/pub/databases/embl/release/relnotes.doc

For those installing the new release into software systems such as GCG or SRS,
please note that there is an extra file EST5.DAT to deal with.

Peter Stoehr

----- EXCERPT FROM RELEASE NOTES ----------------------------------------------


The EMBL Nucleotide Sequence Database was frozen to make Release 46 on the  28th
February  1996.   The  release  contains  701,246  sequence  entries  comprising
473,691,480 nucleotides.  This represents an increase of about 11% over  Release
45.  A breakdown of Release 46 by taxonomic division is shown below:

                  Division             Entries    Nucleotides
                  -----------------    -------    ------------
                  Bacteriophage           1155         1709794
                  ESTs                  435315       151738729
                  Fungi                  11202        29461528
                  Invertebrates          17455        52142164
                  Organelles             12692        13268750
                  Other Mammals           7724         8325527
                  Other Vertebrates       9019        10225112
                  Plants                 14411        18129030
                  Primates               55763        52673518
                  Prokaryotes            28771        51378771
                  Rodents                28678        32367952
                  STSs                   25900         8584341
                  Synthetic              11290         5473089
                  Unclassified           12226         5776683
                  Viruses                29645        32436492
                  -----------------    -------    ------------
                  Total                 701246       473691480
                  Other patents           3165          348946
                  -----------------    -------    ------------
                  Grand Total           704411       474040426

1.1  Database Cross-references

At the previous release we introduced a new feature table  qualifier  "/db_xref"
to  represent  cross-references to external databases.  This qualifier is valid,
but  optional,  for  all  feature  keys.   There  are  two  components  to   the
cross-reference  value,  the name of the database and the identifier within that
database being referenced, formatted as follows:


In this release, we have included cross-references using the "/db_xref"  on  CDS
features to the FLYBASE Drosophila database.

A cross-reference from a CDS feature to the database  "FLYBASE"  indicates  that
this  feature  corresponds  to the entity (eg gene name) in the FLYBASE database
with the given identifier, eg.


1.2  EST Database Files

In order to keep the size  of  the  data  files  within  reasonable  limits  for
handling  purposes,  we have split the EST division into several files.  At this
release we have created a fifth file of EST  data  named  EST5.DAT.   Additional
files will be added in subsequent releases as appropriate.


2.1  Separation Of Human Sequence Data

At the next release (Release 47,  June  1996)  we  intend  to  introduce  a  new
database division HUM for human (non-EST/STS/organelle) data.  The primate (PRI)
division will be withdrawn and the small remainder of the primate sequence  data
(circa 3000 sequences) will be merged with the other mammal (MAM) division.

We expect the volume of human data to increase  dramatically  in  the  next  few
years.   In order to keep database files at a manageable size, we will split the
HUM division into several database files, named  HUM1.DAT,  HUM2.DAT  etc..   We
will add further database files in subsequent releases as appropriate.

2.2  *IMPORTANT* Notice Of Accession Number Format Change

Nucleotide Sequence Database Collaborative Agreement, 31 May 1995

Currently, accession numbers used by the nucleotide sequence  databases  consist
of  one  prefix  letter  followed by 5 digits.  EST projects and projects to add
patent data have accelerated the need to extend the accession number space.   It
is projected that the databases will run out of accession numbers within 8 to 10

It is clear that:

* As much notice as possible should be given to users and software developers
* The change should make a large enough space that another change will not
  be necessary in the foreseeable future.
* The accession number should continue to be readily identifiable as a
  DDBJ/EMBL/GenBank accession number.

The collaborators concluded that:

* A new form of accession number will be created, defined as an
  8-character alphanumeric string, beginning with two upper case
  letters and followed only by digits (e.g., SR004562).  Leading and
  trailing zeros are significant.  The letter 'O' will not be used.

* Existing 6-character accession numbers will remain as they are, and will
  never be transformed to an 8-character form.

* New accession numbers will not be used before February 1, 1996. The groups
  agree to avoid using new accession numbers as long as possible after that.

The International Nucleotide Sequence Databases

2.3  New Nucleic Acid Identifier (NI) Line

We intend to introduce a new line type NI to  contain  an  identifier  for  each
nucleic  acid  sequence.  While the sequence remains the same, so will the value
of this identifier.  When a sequence change occurs,  however  minor,  a  new  NI
value  will  be  assigned  whilst the accession number on the AC line may remain
unchanged.  These NI values are analagous to those to be represented in the  NID
lines  of  GenBank  entries,  and we will inherit GenBank NID values into our NI
lines.  Starting at release 47 (June 1996), each entry will have an NI  line  of
the form:

   AC   U35111;
   NI   g1006834

2.4  Taxonomy

The NCBI have recently put significant effort into a project 'Taxon' to create a
taxonomy  database  which  reflects  current  phylogenetic  knowledge.   It is a
sequence-based taxonomy as much as possible, and based on published  authorities
wherever  possible.   Taxon  is  being  maintained  by three NCBI scientists and
curated by a panel of established evolutionary molecular biologists.

We will incorporate this taxonomy into our database at an  opportune  moment  in
the  coming  months, when a few operational details are resolved.  At that time,
the OC lines of all entries will reflect the revised taxonomic classification.

----- END OF EXCERPT---------------------------------------------------------

