GenBank Release 134.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Fri Feb 14 14:01:52 EST 2003


Greetings GenBank Users,

  GenBank Release 134.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 134.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 134.0

  Uncompressed, the Release 134.0 flatfiles require approximately 96.94 GB
(sequence files only) or 109.9 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 86.36 GB. From the
release notes:

   Release  Date       Base Pairs   Entries

   133      Dec 2002   28507990166  22318883
   134      Feb 2003   29358082791  23035823

  Close-of-data was 02/10/2003. Four working days were required to prepare
this release. In the six week period between the close dates for GenBank
releases 133.0 and 134.0, GenBank grew by 850,092,625 basepairs and by
716,940 sequence records. During that same period, 64,040 records were
updated. Combined, this yields an average of about 18,600 new/updated
records per day.

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of large files (entire releases, the
GenBank Cumulative Update, etc) might realize an improvement in transfer
rates from these alternate sites when traffic at the NCBI is heavy.

  For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 134.0 and Upcoming Changes) have been appended below.

                         * * * IMPORTANT * * *

  As described in the October and December 2002 release notes, the
GenBank Cumulative Update (GBCU) data products are no longer supported as
of this February 2003 release. Details about this change are available in
Section 1.3.2 of the release notes. Note that NCBI will continue to generate
the GBCU products for an additional three weeks, on an unsupported basis,
as an aid to those who need additional time to transition to our incremental
update products.

  Release 134.0 data, and subsequent updates, are available now via NCBI's
Entrez and Blast services.

  New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 134.0 close-of-data, should be
available by about 10:00am EST, February 15. Please note that the new CUs will 
be
smaller than previous versions you might have obtained after Release 133.0 was
posted.

  If you encounter problems while ftp'ing or uncompressing Release 134.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev, Michael Kimelman
GenBank
NCBI/NLM/NIH


1.3 Important Changes in Release 134.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 10 with this release:

  - the EST division is now comprised of 240 files (+5)
  - the GSS division is now comprised of 66 files  (+3)
  - the HTC division is now comprised of 4 files   (+1)
  - the HTG division is now comprised of 58 files  (+1)

1.3.2 * * Cumulative GenBank Update Products Discontinued * *

  As of GenBank Release 134.0, the cumulative GenBank Update (GBCU)
products have been discontinued:

	ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/gbcu.aso.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.flat.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.fsa_nt.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.gnp.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.qscore.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gpcu.fsa.gz

  In the eight weeks between typical GenBank Releases, it was not uncommon
for GBCU products to approach 20% of the total database size. The flatfile
version, for example, reached sizes in excess of 17 GB in late 2002.

  From a user perspective, repeatedly obtaining and processing such a
large update product makes inefficient use of both bandwidth and local
resources, compared to the much smaller incremental GbUpdate products.

  And in order to reliably generate the GBCU in the face of such explosive
growth, NCBI would have to invest significant resources to increase the 
performance of a large body of software.

  Given these factors, plus the questionable value of an "update" product,
generated daily, and approaching 20GB in size, NCBI has discontinued
support for the GBCU products.

  However, as an aid to those users who may not yet have completed 
transitioning to the use of incremental update products, GBCU files
will continue to be generated, on an _unsupported_ basis, for approximately
three more weeks. After that time, the GBCU files will be removed from
the NCBI FTP site.

1.3.3 Third-Party Annotation Data Collection

  Pursuant to agreements made at their 2002 Collaborative Meeting,
DDBJ/EMBL/GenBank have undertaken the collection of a new class of
sequence data : Third-Party Annotation (TPA).

  The TPA data-collection complements the existing DDBJ/EMBL/GenBank
comprehensive database of primary nucleotide sequences, which typically
result from direct sequencing of cDNAs, ESTs, genomic DNAs, etc.

  'Primary data' are defined to be data for which the submitting group has
done the sequencing and annotation, and hence, as owner of the data,
has privileges to update/correct the associated sequence records. In
contrast, non-primary (TPA) sequences are defined as sequences which:

  a) consist exclusively of sequence data from one, or several,
     previously-existing primary entries owned by other groups, or

  b) consist of a mixture of previously-existing primary entries,
     some owned by the TPA submittor and the rest by one or more other
     groups

  Complete details regarding TPA sequence submission can be found
at the NCBI website:

     http://www.ncbi.nlm.nih.gov/Genbank/tpa.html
  
TPA categories and requirements  
-------------------------------

  Users can submit new annotation of single sequences or assemblies
of sequences that are owned by other groups to the TPA data
collection.

  The primary sequences must be available in the DDBJ/EMBL/GenBank
databases, and submitters to the TPA database must provide the
accession numbers of the primary sequences in their TPA submission.

  TPA sequences based on primary data available only in proprietary
databases are not accepted.

  Some examples of data submissions accepted for TPA include:

     1. analysis and re-annotation of DDBJ/EMBL/GenBank sequences
        owned by other groups
     2. gap-filling, in which a TPA submittor might utilize HTG or
        EST data to complete an otherwise incomplete sequence
     3. TPA sequences based on NCBI/Ensembl trace archive data
     4. TPA sequences based on Whole Genome Shotgun (WGS) sequences

  Sequences based on primary data from multiple organisms are not
accepted.

  Sequences will not be accepted for TPA in lieu of an update to
primary records. A submittor who owns a primary record is expected
to update that record as new sequence is determined, or sequencing
ambiguities/errors are resolved.

  Any newly-determined sequence data that is to be part of a TPA
record must first be submitted as a new primary sequence to
DDBJ/EMBL/GenBank.
  
  The TPA dataset is intended to present sequence data and annotation
in support of actual biological discoveries that are published in
the scientific literature, without requiring that the sequence be
determined by the authors/submitters.
  
  In order to assure that the sequence annotation is of high quality, 
it is required that TPA records be associated with a study published
in a peer-reviewed journal before the data is released to the public.

  TPA records include a mandatory 'PRIMARY' block, which documents the
relationships between spans of the TPA sequence and the primary
(non-TPA) sequences that contributed to it. The elements of the
PRIMARY block are:
     
  a) TPA-SPAN             base span on TPA sequence  
  b) PRIMARY_IDENTIFIER   acc.version of contributing sequence(s) 
  c) PRIMARY_SPAN         base span on contributing primary sequence
  d) COMP                 'c' is used to indicate that contributing 
                          sequence is originating from complementary 
                          strand in primary sequence entry
  Example:

  TPA_SPAN       PRIMARY_IDENTIFIER     PRIMARY_SPAN     COMP
  1-426          AC004528.1             18665-19090         
  427-526        AC001234.2             1-100            c


TPA data products
-----------------

  TPA update products became available at the NCBI FTP site on Friday,
January 31, 2003. Daily, incremental update files for all new/updated
TPA records are located in:

        ftp://ftp.ncbi.nih.gov/tpa/updates

TPA updates have filename prefixes of:

        tpa_upd.YYYY.MMDD.

Filename suffixes for these updates are:

	.bbs     : binary Bioseq-set (ASN.1)
	.gbff    : GenBank flatfile
	.gnp     : GenPept flatfile
	.fsa_nt  : Nucleotide FASTA
	.fsa_aa  : Protein FASTA

  We do not expect to generate complete releases (similar to GenBank
releases) for TPA until the volume of TPA records has substantially
increased. Until that time, a set of cumulative TPA update files
containing all TPA records is available in:

        ftp://ftp.ncbi.nih.gov/tpa/release

Cumulative TPA update files have filename prefixes of:

        tpa_cu.

and utilize the same filename suffixes that are listed above. Note
that the cumulative TPA products will be *discontinued* once TPA
releases are being built.


1.3.4 GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.

  There is thus a discrepancy between the filenames and file headers of nine
GSS flatfiles in Release 134.0. Consider the gbgss56.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                          February 15 2003

                NCBI-GenBank Flat File Release 134.0

                           GSS Sequences (Part 1)

   88066 loci,    66600405 bases, from    88066 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "56" based on the files dumped from the other system.

  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.


1.4 Upcoming Changes

1.4.1 New /mol_type qualifier

  As of the April 2003 GenBank Release (134.0), a new source feature
  qualifier called /mol_type will begin to be used for source features.

  This qualifier will be used to indicate the in-vivo biological state
  of the sequence presented in a database record.

  The preliminary definition for /segment is :
        Qualifier       /mol_type=
        Definition      in vivo molecule type  
        Value format    "text"
        Example         /mol_type="genomic DNA", 

        Comment         text limited to "genomic DNA", "genomic RNA", "mRNA" 
(incl EST), 
                        "tRNA", "rRNA", "snoRNA", "snRNA", "scRNA", "pre-mRNA",        
                        "other RNA" (incl. synthetic), "other DNA" (incl. 
synthetic),
                        "unassigned DNA" (incl. unknown),"unassigned RNA" (incl. 
unknown)

  In-vivo molecule type information is already presented on the LOCUS
  line of the GenBank flatfile format. However, introducing /mol_type
  in the Feature Table will make the exchange of this information among
  DDBJ, EMBL, and GenBank more complete and accurate.

  NOTE: /mol_type will eventually be a mandatory qualifier for the source
  feature, probably by June 2003.

1.4.2 New /segment qualifier

  As of the April 2003 GenBank Release (134.0), a new source feature
  qualifier called /segment will begin to be used for source features.

  In the absence of a more suitable way to annotate viral segments, this 
  information had either not been included in database entries, or had been 
  annotated incorrectly (e.g. using /chromosome, /map etc). This new
  qualifier addresses that lack.

  The preliminary definition for /segment is :

        Qualifier       /segment=    
        Definition      name of viral or phage segment sequenced
        Value format    "text"
        Example         /segment="6"

1.4.3 New /locus_tag qualifier

  As of the April 2003 GenBank Release (134.0), a new source feature
  qualifier called /locus_tag will begin to be used.

  Many complete-genome sequencing projects use solely computational
  methods to predict coding regions and genes. The /locus_tag qualifier
  provides a method for identifying and tracking the results of such
  computations, without utilizing existing qualifiers such as /gene .

  These 'locus tags' are systematically assigned, and do not necessarily
  reflect gene name/symbol conventions in experimental literature. Hence
  the introduction of this new qualifier.

  The preliminary definition for /locus_tag is :

        Qualifier:      /locus_tag
        Definition:     feature tag assigned for tracking purposes 
        Value Format:   "text" (single token)
        Example:        /locus_tag="RSc0382"
                        /locus_tag="YPO0002"
        Comment:        /locus_tag can be used with any feature where /gene 
                        is valid;

---


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  





More information about the Genbankb mailing list