GenBank Release 133.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Sun Jan 5 20:36:51 EST 2003


Greetings GenBank Users,

  GenBank Release 133.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 133.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 133.0

  Uncompressed, the Release 133.0 flatfiles require roughly 94.33 GB
(sequence files only) or 107.07 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires roughly 84.21 GB. From the
release notes:

   Release  Date       Base Pairs   Entries

   132      Oct 2002   26525934656  19808101
   133      Dec 2002   28507990166  22318883

  Close-of-data was 12/31/2002. Four working days were required to prepare
this release. In the eight week period between close-of-data for GenBank
releases 132.0 and 133.0, GenBank grew by 1.982 billion basepairs and by
2,510,782 sequence records. During that same period, 133,297 records were
updated. Combined, this yields an average of about 44,000 new/updated
records per day.

  The growth in the number of records is the largest experienced in
GenBank's history. The growth in the number of basepairs is the third
largest.

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of large files (entire releases, the
GenBank Cumulative Update, etc) might realize an improvement in transfer
rates from these alternate sites when traffic at the NCBI is heavy.

  For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 133.0 and Upcoming Changes) have been appended below.

                         * * * IMPORTANT * * *

  As described in the October release notes, the GenBank Cumulative Update
data products will be discontinued in February of 2003. We strongly urge users
of the GBCU to review Section 1.4.1 of the release notes for further 
information.

  Release 133.0 data, and subsequent updates, are available now via NCBI's
Entrez and Blast services.

  New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 133.0 close-of-data, should be
available by about 10:00am EST, January 6. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 132.0 was
posted.

  If you encounter problems while ftp'ing or uncompressing Release 133.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev, Michael Kimelman
GenBank
NCBI/NLM/NIH


1.3 Important Changes in Release 133.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 55 with this release:

  - the EST division is now comprised of 235 files
  - the GSS division is now comprised of 63 files
  - the HTC division is now comprised of 3 files
  - the HTG division is now comprised of 57 files
  - the PAT division is now comprised of 7 files
  - the PLN division is now comprised of 7 files
  - the PRI division is now comprised of 24 files
  - the ROD division is now comprised of 6 files

  However, note that there was a special supplemental file (gbsup.seq) for
  full-length-insert cDNA sequences in Release 132.0.

  The fli-cDNA sequences *should* have been present in the standard
  divisional files. The problem that required the use of the supplemental
  file has been corrected, so this release does not include gbsup.seq .

  If the removal of gbsup.seq is taken into account, the total number of
  sequence data files increased by 54 .
  
1.3.2 New SET Data File For The ASN.1 Representation

  Some phylogenetic and mutational studies involve sequences from more
than one of the 'taxonomic' divisions of GenBank. For example, a phylogenetic
study might involve sequences obtained from human (PRI) and non-primate
mammalian (MAM) sources.

  Such studies, often with associated sequence alignments, are maintained
and edited as a single unit in the underlying data representation (ASN.1)
utilized by the NCBI.

  When generating GenBank flatfiles from such studies, the component
sequences are processed in such a way that they are individually directed
to an appropriate divisional file. For example, the human sequences to a
PRI division file, the other mammalian sequences to the MAM division file.

  In the past, we have mimicked this behavior for the ASN.1 version of
GenBank releases by splitting the studies into their components, and 
splicing them into an appropriate divisional ASN.1 file (eg, gbpri1.aso
and gbmam.aso) .

  This practice has clear disadvantages: the components of the studies
really should *not* be separated; and the post-processing of these special
studies adds considerable overhead to release processing.

  Starting with this December 2002 release, we have ceased this practice
and have introduced a new ASN.1 data file for such multi-divisional
studies at ftp://ftp.ncbi.nih.gov/ncbi-asn1 . The new file is :

	gbset.aso

1.3.3 Reduction In The Number Of ASN.1 Data Files

  The sizes of many of the files for the ASN.1 version of GenBank releases
(see ftp://ftp.ncbi.nih.gov/ncbi-asn1 ) used to be well below the 250 MB 
utilized
for the GenBank flatfile version. For example, the PRI division ASN.1 files
were about 140 MB apiece, and the HTG division files are about 190 MB apiece,
for GenBank Release 132.0 .

  This was due to the fact that the ASN.1 representation was originally
used to create the flatfile version, on a file-by-file basis, during release
generation. Since the ASN.1 version is more compact than the flatfile version,
the ASN.1 file sizes had to be less than 250 MB to yield 250 MB flatfiles.

  Now that the ASN.1 and flatfile versions are created independently, the
sizes of the ASN.1 files can be increased without consequences for the
flatfiles.

  Starting with this December 2002 release, the file size limit for all
ASN.1 files has been increased to 250MB, and as a result, the total number of
ASN.1 files has been significantly reduced.

1.3.4 GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.

  There is thus a discrepancy between the filenames and file headers of nine
GSS flatfiles in Release 133.0. Consider the gbgss55.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                          December 15 2002

                NCBI-GenBank Flat File Release 133.0

                           GSS Sequences (Part 1)

   88062 loci,    66597557 bases, from    88062 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "55" based on the files dumped from the other system.

  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 * * Cumulative GenBank Update Products To Be Discontinued * *

  As of GenBank Release 134.0 in February of 2002, the cumulative
GenBank Update (GBCU) products will be discontinued:

	ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/gbcu.aso.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.flat.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.fsa_nt.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.gnp.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.qscore.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gpcu.fsa.gz

  In the eight weeks between typical GenBank Releases, it is not uncommon
for GBCU products to approach 20% of the total database size. The flatfile
version, for example, has reached sizes in excess of 17 GB in recent weeks.

  From a user perspective, repeatedly obtaining and processing such a
large update product makes inefficient use of both bandwidth and local
resources, compared to the much smaller incremental GbUpdate products.

  And in order to reliably generate the GBCU in the face of such explosive
growth, NCBI would have to invest significant resources to increase the 
performance of a large body of software.

  Given these factors, plus the questionable value of an "update" product,
generated daily, which will soon approach 20GB in size, we have decided
that the GBCU should be discontinued. We will analyze FTP logs and 
proactively contact the larger centers which utilize the GBCU, to suggest
alternate processing strategies.

  If large numbers of users are unable to switch to processing incremental
updates by February 2002, there is a possibility that the date for
discontinuing the GBCU might be pushed back to April.

  We will keep users informed of the timetable for this important change
via these release notes and the GenBank newsgroup. And of course, we 
welcome discussions of this change via the newsgroup.

1.4.2 New /mol_type qualifier

  As of the April 2003 GenBank Release (134.0), a new source feature
  qualifier called /mol_type will begin to be used for source features.

  This qualifier will be used to indicate the in-vivo biological state
  of the sequence presented in a database record.

  The preliminary definition for /segment is :
        Qualifier       /mol_type=
        Definition      in vivo molecule type  
        Value format    "text"
        Example         /mol_type="genomic DNA", 

        Comment         text limited to "genomic DNA", "genomic RNA", "mRNA" 
(incl EST), 
                        "tRNA", "rRNA", "snoRNA", "snRNA", "scRNA", "pre-mRNA",        
                        "other RNA" (incl. synthetic), "other DNA" (incl. 
synthetic),
                        "unassigned DNA" (incl. unknown),"unassigned RNA" (incl. 
unknown)

  In-vivo molecule type information is already presented on the LOCUS
  line of the GenBank flatfile format. However, introducing /mol_type
  in the Feature Table will make the exchange of this information among
  DDBJ, EMBL, and GenBank more complete and accurate.

  NOTE: /mol_type will eventually be a mandatory qualifier for the source 
feature,
  probably by June 2003.

1.4.3 New /segment qualifier

  As of the April 2003 GenBank Release (134.0), a new source feature
  qualifier called /segment will begin to be used for source features.

  In the absence of a more suitable way to annotate viral segments, this 
  information had either not been included in database entries, or had been 
  annotated incorrectly (e.g. using /chromosome, /map etc). This new
  qualifier addresses that lack.

  The preliminary definition for /segment is :

        Qualifier       /segment=    
        Definition      name of viral or phage segment sequenced
        Value format    "text"
        Example         /segment="6"

1.4.4 New /locus_tag qualifier

  As of the April 2003 GenBank Release (134.0), a new source feature
  qualifier called /locus_tag will begin to be used.

  Many complete-genome sequencing projects use solely computational
  methods to predict coding regions and genes. The /locus_tag qualifier
  provides a method for identifying and tracking the results of such
  computations, without utilizing existing qualifiers such as /gene .

  These 'locus tags' are systematically assigned, and do not necessarily
  reflect gene name/symbol conventions in experimental literature. Hence
  the introduction of a new qualifier.

  The preliminary definition for /locus_tag is :

        Qualifier:      /locus_tag
        Definition:     feature tag assigned for tracking purposes 
        Value Format:   "text" (single token)
        Example:        /locus_tag="RSc0382"
                        /locus_tag="YPO0002"
        Comment:        /locus_tag can be used with any feature where /gene 
                        is valid;

1.4.5 Third-Party Annotation and Consensus Sequences (TPA)

  Pursuant to agreements made at the 2002 Collaborative Meeting, 
DDBJ/EMBL/GenBank
  have undertaken the collection of a new class of sequence data : Third-Party
  Annotation and Consensus Sequences (TPA).

  The TPA data-collection will complement the existing DDBJ/EMBL/GenBank
  comprehensive database of primary nucleotide sequences, which typically result
  from direct sequencing of cDNAs, ESTs, genomic DNAs, etc.

  'Primary data' are defined to be data for which the submitting group has done
  the sequencing and annotation, and as 'owner' of these data has privileges to
  update/correct the associated sequence records.

  In contrast, non-primary (TPA) sequences are defined as sequences which:

  a) consist exclusively of sequence data from one, or several,
     previously-existing entries 'owned' by other groups, or

  b) consist of a mixture of new & previously-existing sequences

  TPA categories and requirements  
  -------------------------------

  Users can submit re-annotations/re-assemblies of sequences already 
  present in DDBJ/EMBL/GenBank and owned by other groups to be 
  included in the Third Party Annotation (TPA) data-collection. 

  Categories of data submissions accepted for TPA include:

     1. re-annotation/analysis of sequence(s) from DDBJ/EMBL/GenBank
     2. mixtures of primary/non-primary sequences, including regions of 
        new and existing sequence (e.g. filling gaps in a sequence
	with data from HTG or EST projects, or newly sequenced data)
     3. TPA sequences based on NCBI/Ensembl trace archive data
     4. TPA sequences based on Whole Genome Shotgun (WGS) sequences

  Consensus sequences from multiple organisms are not accepted.
 
  The TPA dataset is primarily intended as a means to present sequence
  and annotation in support of actual biological discoveries, published
  in the scientific literature, without requiring that every basepair
  has actually been sequenced by the authors/submittors. 
  
  In order to assure that the sequence annotation is of high quality, 
  it is required that TPA records be associated with a study published
  in a peer-reviewed journal before the data is released to the public.

  Third Party Annotation (TPA) records include a mandatory 'TPA-block'
  which documents the relationships between spans of the TPA sequence
  and the primary (non-TPA) sequences that contributed to it. The
  elements of the TPA-block are:
     
  a) TPA-SPAN             base span on TPA sequence  
  b) PRIMARY_IDENTIFIER   acc.version of contributing sequence(s) 
  c) PRIMARY_SPAN         base span on contributing primary sequence
  d) COMP                 'c' is used to indicate that contributing 
                          sequence is originating from complementary 
                          strand in primary sequence entry
  Example:

  TPA_SPAN       PRIMARY_IDENTIFIER     PRIMARY_SPAN     COMP
  1-426          AC004528.1             18665-19090         
  427-526        AC001234.2             1-100            c

  Preliminary exchange of TPA records among DDBJ/EMBL/GenBank are
  underway. Within two months, data products will be made available at
  the GenBank FTP site for TPA sequences. Details about those products,
  sample records, and instructions for submission of TPA data, will
  be communicated via the GenBank newsgroup:

          http://net.bio.net/hypermail/genbankb/


---


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  





More information about the Genbankb mailing list