GenBank Release 135.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Tue Apr 15 19:33:44 EST 2003


Greetings GenBank Users,

  GenBank Release 135.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 135.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 135.0

  Uncompressed, the Release 135.0 flatfiles require approximately 100 GB
(sequence files only) or 114 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 89.67 GB. From the
release notes:

   Release  Date       Base Pairs   Entries

   134      Feb 2003   29358082791  23035823
   135      Apr 2003   31099264455  24027936

  Close-of-data was 04/09/2003. Four working days were required to prepare
this release. In the eight week period between the close dates for GenBank
releases 134.0 and 135.0, GenBank grew by 1,741,181,664 basepairs and by
992,113 sequence records. During that same period, 319,884 records were
updated. Combined, this yields an average of about 23,400 new/updated
records per day.

  Since the close-date for this release nearly coincided with the April 14th
announcement of the completion of the Human Genome Project, it seems 
appropriate to report some statistics for human sequences as well:

	Basepairs of human sequence :      9,743,398,611
	Number of human sequence records : 6,574,171

Note that these figures include human EST, GSS, HTC, HTG, MGC, PRI, and
STS sequences.

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of GenBank releases might realize
an improvement in transfer rates from these alternate sites when traffic
at the NCBI is heavy.

  For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 135.0 and Upcoming Changes) have been appended below.

  Release 135.0 data, and subsequent updates, are available now via NCBI's
Entrez and Blast services.

  If you encounter problems while ftp'ing or uncompressing Release 135.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev, Michael Kimelman
GenBank
NCBI/NLM/NIH


1.3 Important Changes in Release 135.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 17 with this release:

  - the BCT division is now comprised of 7 files   (+1)
  - the EST division is now comprised of 244 files (+4)
  - the GSS division is now comprised of 70 files  (+4)
  - the HTG division is now comprised of 62 files  (+4)
  - the INV division is now comprised of 6 files   (+1)
  - the PAT division is now comprised of 8 files   (+1)
  - the PRI division is now comprised of 25 files  (+1)
  - the ROD division is now comprised of 7 files   (+1)

1.3.2 New /mol_type qualifier

  As of the April 2003 GenBank Release (135.0), a new source feature
  qualifier called /mol_type will begin to be used for source features.

  This qualifier will be used to indicate the in-vivo biological state
  of the sequence presented in a database record.

  The preliminary definition for /segment is :
        Qualifier       /mol_type=
        Definition      in vivo molecule type  
        Value format    "text"
        Example         /mol_type="genomic DNA", 

        Comment         text limited to "genomic DNA", "genomic RNA", "mRNA" (incl EST), 
                        "tRNA", "rRNA", "snoRNA", "snRNA", "scRNA", "pre-mRNA",        
                        "other RNA" (incl. synthetic), "other DNA" (incl. synthetic),
                        "unassigned DNA" (incl. unknown),"unassigned RNA" (incl. unknown)

  In-vivo molecule type information is already presented on the LOCUS
  line of the GenBank flatfile format. However, introducing /mol_type
  in the Feature Table will make the exchange of this information among
  DDBJ, EMBL, and GenBank more complete and accurate.

  NOTE: /mol_type will eventually be a mandatory qualifier for the source
  feature, probably by June 2003.

1.3.3 New /segment qualifier

  As of the April 2003 GenBank Release (135.0), a new source feature
  qualifier called /segment will begin to be used for source features.

  In the absence of a more suitable way to annotate viral segments, this 
  information had either not been included in database entries, or had been 
  annotated incorrectly (e.g. using /chromosome, /map etc). This new
  qualifier addresses that lack.

  The preliminary definition for /segment is :

        Qualifier       /segment=    
        Definition      name of viral or phage segment sequenced
        Value format    "text"
        Example         /segment="6"

1.3.4 New /locus_tag qualifier

  As of the April 2003 GenBank Release (135.0), a new source feature
  qualifier called /locus_tag will begin to be used.

  Many complete-genome sequencing projects use solely computational
  methods to predict coding regions and genes. The /locus_tag qualifier
  provides a method for identifying and tracking the results of such
  computations, without utilizing existing qualifiers such as /gene .

  These 'locus tags' are systematically assigned, and do not necessarily
  reflect gene name/symbol conventions in experimental literature. Hence
  the introduction of this new qualifier.

  The preliminary definition for /locus_tag is :

        Qualifier:      /locus_tag
        Definition:     feature tag assigned for tracking purposes 
        Value Format:   "text" (single token)
        Example:        /locus_tag="RSc0382"
                        /locus_tag="YPO0002"
        Comment:        /locus_tag can be used with any feature where /gene 
                        is valid;

1.3.5 GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.

  There is thus a discrepancy between the filenames and file headers of nine
GSS flatfiles in Release 135.0. Consider the gbgss60.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                           April 15 2003

                NCBI-GenBank Flat File Release 135.0

                           GSS Sequences (Part 1)

   86691 loci,    65546719 bases, from    86691 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "60" based on the files dumped from the other system.

  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.


1.4 Upcoming Changes

  No changes to the format of GenBank releases are currently scheduled.



---


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  





More information about the Genbankb mailing list