EMBL Release 52 available

Peter Stoehr stoehr at ebi.ac.uk
Wed Nov 12 13:29:40 EST 1997

Release 52 of the EMBL Nucleotide Sequence Database is available from the
EBI FTP server.

FTP server        Directory
-----------       ---------
ftp.ebi.ac.uk     pub/databases/embl/release       EMBL flat-file format
                  pub/databases/embl/release/gcg   GCG8.x standard format

Some extracts from the release notes are appended below. Fuller details in:

The cumulative update file pub/databases/embl/new/cumulative.dat.Z will be
changed at Thursday 10-NOV-1997 15:00 GMT to include only data created since
release 52.

Peter Stoehr

The EMBL Nucleotide Sequence Database was frozen to make Release 52 on the  17th
October  1997.   The  release  contains  1,787,004  sequence  entries comprising
1,181,167,498 nucleotides.  This  represents  an  increase  of  about  26%  over
Release 51.  A breakdown of Release 52 by division is shown below:

                  Division             Entries     Nucleotides
                  -----------------    -------     -----------
                  Bacteriophage           1351         2132530
                  ESTs                 1282047       472985339
                  Fungi                  17338        43598348
                  GSSs                   46472        24917886
                  HTG                     1300        89642076
                  Human                  71873       124647874
                  Invertebrates          27362       105437150
                  Organelles             23449        22009199
                  Other Mammals          13715        13190473
                  Other Vertebrates      12732        14134235
                  Plants                 21031        34469699
                  Patent                 90819        29402838
                  Prokaryotes            40292        90331355
                  Rodents                36085        44729959
                  STSs                   50721        17472981
                  Synthetic               2382         5192851
                  Unclassified            2540         2377434
                  Viruses                45495        44495271
                  -----------------   --------    ------------
                  Total                1787004      1181167498
                  -----------------   --------    ------------

1.1  EST Database Files

In order to keep the size  of  the  data  files  within  reasonable  limits  for
handling  purposes,  we have split the EST division into several files.  At this
release we have created  two  extra  files  of  EST  data  named  EST12.DAT  and
EST13.DAT.    Additional   files   will  be  added  in  subsequent  releases  as

1.2  Patent Sequences In PATENT.DAT

At this release we have grouped all entries derived from the  patent  literature
into  one  file  PATENT.DAT.   Previously we had merged such data into the other
divisions based on taxonomy, and PATENT.DAT only contained sequences which  were
shorter  than  15bp and which were known to be duplicates or reverse complements
of existing patent sequences.


2.1  Nucleotide And Protein Identifiers

2.1.1  Nucleotide Indentifiers

The NI linetype of  the  EMBL  flat-file  format  currently  contains  a  unique
identifier for the nucleotide sequence.  While the sequence remains the same, so
does the value of this identifier.   When  a  sequence  change  occurs,  however
minor,  a  new  NI  value will be assigned whilst the accession number on the AC
line may remain unchanged.  These  identifiers  are  collaboratively  maintained
with GenBank and DDBJ, for example:

     NI   g21954

It has become clear from users and other database groups that confusion has been
created  about  the  relationship between these identifiers and the GenBank 'gi'

It  has  been  decided  therefore  to  introduce  a  new  system  of  nucleotide
identifiers  of the form 'accession.version', eg:  X12345.3, where the accession
number part will be stable, but the version part  will  be  incremented  if  the
sequence changes.

Subject to synchronisation of this change with GenBank  and  DDBJ,  we  plan  to
implement this new form of nucleotide identifier at release 54, March 1998.

2.1.2  Protein Identifiers

Protein identifiers are currently assigned to all CDS features in the nucleotide
sequence database and are found in the feature table qualifier /db_xref, eg:


As for nucleotide identifier values (above), confusion  resulted  amongst  users
concerning  the  relationship  of these to GenBank 'gi' numbers also assigned to
CDS features.  To clarify this, and to adopt a comparable scheme of  identifiers
for  both  nucleotides and proteins, the collaborating databases have decided to
create a new feature table qualifier /protein_id, eg:


This form  of  identifier  also  allows  easier  tracking  of  changing  protein
identifiers by external databases than the previous PIDs.

This qualifier consists of a stable ID portion  (3+5  format  with  3  positions
letters and 5 numbers) plus a version number after a decimal point.  The version
number will change only when the protein sequence  coded  by  the  CDS  changes,
while  the stable part will remain unchanged.  This qualifier will be valid only
on CDS features which translate into a valid protein.

Subject to synchronisation of this change with GenBank  and  DDBJ,  we  plan  to
implement this new form of protein identifier at release 54, March 1998.

More information about the Embl-db mailing list