IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

EMBL release 53 available

Peter Stoehr stoehr at ebi.ac.uk
Thu Jan 8 10:41:04 EST 1998


Release 53 of the EMBL Nucleotide Sequence Database is now available from the
EMBL-EBI ftp site: ftp://ftp.ebi.ac.uk/pub/databases/embl/release

This release is already installed on other EBI query/search servers, 
SRS/Blast/Fasta. A few extracts from the release notes follow, but they can
be read in full at:
 ftp://ftp.ebi.ac.uk/pub/databases/embl/release/relnotes.doc or
 http://www.ebi.ac.uk/ebi_docs/embl_db/relnotes53/relnotes.html

For those who maintain local copies using our cumulative update file in
ftp://ftp.ebi.ac.uk/pub/databases/embl/new/cumulative.dat.Z, please note that
this file will be changed to include only post- release 53 data on Friday
Jan 9th at 15:00 GMT (ie it gets smaller then).

Regards,
Peter Stoehr
EMBL-EBI
--------

1 RELEASE 53

The EMBL Nucleotide Sequence Database was frozen to make Release 53 on the  16th
December  1997.   The  release  contains  1,917,868  sequence entries comprising
1,281,391,651 nucleotides.  This represents an increase of about 8% over Release
52.  A breakdown of Release 53 by division is shown below:

                  Division             Entries     Nucleotides
                  -----------------    -------     -----------
                  Bacteriophage           1388         2188305
                  ESTs                 1343796       496603984
                  Fungi                  18137        44602064
                  GSSs                  100154        49099107
                  HTG                     1868       102763872
                  Human                  74384       139022655
                  Invertebrates          28126       107524431
                  Organelles             24715        22870076
                  Other Mammals          14429        13785092
                  Other Vertebrates      13145        14653255
                  Plants                 22136        37736590
                  Patent                 91221        29511807
                  Prokaryotes            42666       102750354
                  Rodents                37043        46489741
                  STSs                   51172        17685717
                  Synthetic               2424         5377292
                  Unclassified            2380         2387088
                  Viruses                48684        46340221
                  -----------------   --------    ------------
                  Total                1917868      1281391651
                  -----------------   --------    ------------


1.1  EST Database Files

In order to keep the size  of  the  data  files  within  reasonable  limits  for
handling  purposes,  we have split the EST division into several files.  At this
release we have created one extra files of EST data named EST14.DAT.  Additional
files will be added in subsequent releases as appropriate.


1.2  Feature Table Qualifiers

1.2.1  New SOURCE Qualifier /specimen_voucher

This new source feature qualifier valid at this release indicates the source  of
a sample , eg a museum identification tag, of the sequenced material.
Qualifier       /specimen__voucher="text"
Definition      an identifier of the individual or collection of the source
                organism and the place where it is currently stored, usually
                an institution.
Value format    "text"
Example         /specimen__voucher="Smith s. n. 4-IV-1995 (U. S. Natl.
                Herbarium)"


1.2.2  New /focus

This new source feature qualifier valid at this release defines the main  source
feature  for  records with more than one source feature (e.g.  proviral/cellular
sequences)
Qualifier       /focus
Definition      defines the preferred source feature for records that
                have more than one source feature
Value format    none
Example         /focus
Comment         this qualifier is to be used only if there is more than
                one source feature. The preferred source feature is
                used to determine which organism is displayed in
                the SOURCE and ORGANISM lines and to determine the
                EMBL division in which it is placed.

For sequences derived from more than one organism, and therefore containing more
than  one  'source'  feature  key,  the /focus qualifier will be attached to the
source key which represents the major organism, that which was the focus of  the
sequencing  effort.   If  no  translation  table is specified, the organism with
/focus will define the translation table.


2  FORTHCOMING CHANGES

2.1  Nucleotide And Protein Identifiers

2.1.1  Nucleotide Indentifiers

The NI linetype of  the  EMBL  flat-file  format  currently  contains  a  unique
identifier for the nucleotide sequence.  While the sequence remains the same, so
does the value of this identifier.   When  a  sequence  change  occurs,  however
minor,  a  new  NI  value will be assigned whilst the accession number on the AC
line may remain unchanged.  These  identifiers  are  collaboratively  maintained
with GenBank and DDBJ, for example:

     NI   g21954

It has become clear from users and other database groups that confusion has been
created  about  the  relationship between these identifiers and the GenBank 'gi'
numbers.

It  has  been  decided  therefore  to  introduce  a  new  system  of  nucleotide
identifiers  of the form 'accession.version', eg:  X12345.3, where the accession
number part will be stable, but the version part  will  be  incremented  if  the
sequence changes.

Subject to synchronisation of this change with GenBank  and  DDBJ,  we  plan  to
implement this new form of nucleotide identifier during 1998.



2.1.2  Protein Identifiers

Protein identifiers are currently assigned to all CDS features in the nucleotide
sequence database and are found in the feature table qualifier /db_xref, eg:


     /db_xref="PID:e123456789"

As for nucleotide identifier values (above), confusion  resulted  amongst  users
concerning  the  relationship  of these to GenBank 'gi' numbers also assigned to
CDS features.  To clarify this, and to adopt a comparable scheme of  identifiers
for  both  nucleotides and proteins, the collaborating databases have decided to
create a new feature table qualifier /protein_id, eg:

     /protein_id="AAA12345.1"

This form  of  identifier  also  allows  easier  tracking  of  changing  protein
identifiers by external databases than the previous PIDs.


This qualifier consists of a stable ID portion  (3+5  format  with  3  positions
letters and 5 numbers) plus a version number after a decimal point.  The version
number will change only when the protein sequence  coded  by  the  CDS  changes,
while  the stable part will remain unchanged.  This qualifier will be valid only
on CDS features which translate into a valid protein.

Subject to synchronisation of this change with GenBank  and  DDBJ,  we  plan  to
implement this new form of protein identifier during 1998.


2.2  Feature Table Qualifiers

2.2.1  New /protein_id

As mentioned above a new feature table qualifier /protein_id will be created:
Qualifier       /protein_id="<identifier>"
Definition      Protein Identifier, issued by International collaborators.
                This qualifier consists of a stable ID portion (3+5 format
                with 3 positions letters and 5 numbers) plus a version number
                as the decimal point.
Example         /protein_id="AAA12345.1"
Comment         Only when the protein sequence coded by the CDS changes, the
                version number will change, while the stable part will remain
                unchanged. This qualifier is valid only on CDS features which
                translate into a valid protein. The list of 3-letter prefixes
                will be maintained by EBI


Subject to synchronisation of this change with GenBank  and  DDBJ,  we  plan  to
implement this new form of protein identifiers during 1998.


2.2.2  /translation And Related Feature Qualifiers

The collaborating databases  DDBJ/EMBL/GenBank  have  decided  that  translation
related qualifiers should only be used with the primary CDS feature key.

These translation related qualifiers are:

  /codon
  /codon_start
  /exception
  /translation
  /transl_table
  /transl_except


Starting at release 54, translation related qualifiers will only be  valid  with
the  CDS  feature  key  and  will  be removed from the following list of non-CDS
features:

  C_region
  D_segment
  exon
  J_segment
  mat_peptide
  N_region
  sig_peptide
  S_region
  transit_peptide
  V_region
  V_segment



More information about the Embl-db mailing list

Send comments to us at biosci-help [At] net.bio.net