GenBank Release 132.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Wed Nov 6 22:37:42 EST 2002


Greetings GenBank Users,

  GenBank Release 132.0 is now available via ftp from the National Center
for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 132.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 132.0

  Uncompressed, the Release 132.0 flatfiles require roughly 83.35 GB
(sequence files only) or 92.75 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires roughly 71.76 GB. From the
release notes:

   Release  Date       Base Pairs   Entries

   131      Aug 2002   22616937182  18197119
   132      Oct 2002   26525934656  19808101

  Close-of-data was 10/29/2002. Seven working days were required to prepare
this release. In the 9.5 week period between close-of-data for GenBank
releases 131.0 and 132.0, GenBank grew by 3.909 billion basepairs and by
1,610,982 sequence records. During that same period, 311,891 records were
updated. Combined, this yields an average of about 28,700 new/updated
records per day.

  This growth is the largest experienced in GenBank's history. It is true
that growth for Release 132.0 is a bit higher than normal due to the delay
in close-of-data caused by problems implementing the format changes for the
SOURCE and ORGANISM lines. But given that the 3.9 Gbp figure completely
eclipses the previous record of nearly 2 Gbp, Release 132.0 would have been
a landmark even without that delay.

  We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank .
Those who experience slow FTP transfers of large files (entire releases, the
GenBank Cumulative Update, etc) might realize an improvement in transfer
rates from these alternate sites when traffic at the NCBI is heavy.

  For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 132.0 and Upcoming Changes) have been appended below.

                         * * * IMPORTANT * * *

  As described in the release notes, a database-wide format change has been
implemented for the SOURCE and ORGANISM lines with this release. In addition,
NCBI plans to discontinue support of the GenBank Cumulative Update products.
So we strongly urge users to review Sections 1.3.4 and 1.4.1 .

  Also, an unplanned supplemental data file (gbsup.seq) is present in
Release 132.0; see Section 1.3.2 for a description of its contents.

  Release 132.0 data, and subsequent updates, are available now via NCBI's
Entrez and Blast services.

  New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 132.0 close-of-data, should be
available by about 10:00am EST, November 7. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 131.0 was
posted.

  If you encounter problems while ftp'ing or uncompressing Release 132.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev
GenBank
NCBI/NLM/NIH


1.3 Important Changes in Release 132.0

1.3.1 Organizational changes

  Due to database growth, the BCT division is now being split into 6 pieces.

  Due to database growth, the EST division is now being split into 190 pieces.

  Due to database growth, the GSS division is now being split into 59 pieces.

  Due to database growth, the HTC division has been newly split into two pieces,
  gbhtc1.seq and gbhtc2.seq .

  Due to database growth, the HTG division is now being split into 56 pieces.

  Due to database growth, the PAT division is now being split into 6 pieces.

  Due to database growth, the PRI division is now being split into 22 pieces.

  Due to database growth, the ROD division is now being split into 5 pieces.

1.3.2 Full-Length-Insert cDNA Problem; Supplemental Release File

  Due to a change in the way sequences associated with full-length-insert cDNA
sequencing projects are handled at the NCBI, 19,345 records were inadvertently 
excluded from the normal divisional GenBank sequence data files. Given the delays
caused by the database-wide SOURCE/ORGANISM format change (see Section 1.3.4),
we have elected to make these records available via a supplemental datafile,
since the alternatives would have resulted in further delays:

	gbsup.seq

  The records that this 'SUP' data file contains will be distributed among the
normal divisions (INV, PLN, etc) for GenBank Release 133.0 . Our apologies for
any inconvenience that the presence of this additional file might cause.

1.3.3 Minor REFERENCE format change

  A small number of records in GenBank have more than 99 literature references
and/or database submission references. This leads to a formatting problem for
the REFERENCE line:

LOCUS       SV4CG                   5243 bp    DNA     circular VRL 14-DEC-2000
DEFINITION  Simian virus 40 complete genome.
ACCESSION   J02400 J02402 J02403 J02406 J02407 J02408 J02409 J02410 J04139
            M24874 M24914 M28728 V01380
VERSION     J02400.1  GI:965480
....
REFERENCE   100(sites)
  AUTHORS   Mertz,J.E., Murphy,A. and Barkan,A.
  TITLE     Mutants deleted in the agnogene of simian virus 40 define a new
            complementation group
  JOURNAL   J. Virol. 45 (1), 36-46 (1983)
  MEDLINE   83112203
   PUBMED   6296443
REFERENCE   101(bases 1 to 129; 5228 to 5243)
  AUTHORS   Byrne,B.J., Davis,M.S., Yamaguchi,J., Bergsma,D.J. and
            Subramanian,K.N.
  TITLE     Definition of the simian virus 40 early promoter region and
            demonstration of a host range bias in the enhancement effect of the
            simian virus 40 72-base-pair repeat
  JOURNAL   Proc. Natl. Acad. Sci. U.S.A. 80 (3), 721-725 (1983)
  MEDLINE   83144002
   PUBMED   6298771

  Note the lack of a space between the REFERENCE number and the basepair span
(or 'sites' designation).

  Since the REFERENCE format is not column-specific beyond column 13 (see
Section 3.4.11), we have addressed this problem by simply adding a space after
any REFERENCE number greater than 99. Those who might parse baserange (or sites)
information from REFERENCE lines based on a column position should modify their
software to tokenize on whitespace characters instead.

1.3.4 Change to the SOURCE and ORGANISM format

Starting with this October 2002 GenBank Release 132.0, a new more flexible SOURCE
format has been adopted, allowing for the display of several types of secondary
names (common names, acronyms, synonyms, anamorphs for the fungi) which can be
derived either from the taxonomy database *or* from the source feature annotation
provided by the submitter.

In addition, the optional organelle prefix has moved from the ORGANISM line 
(in the old format) to the SOURCE line in the new format. The ORGANISM line
now contains only the unadorned organism name, the name by which a sequence
entry is indexed in the taxonomy database.

   --- Previous GenBank format ---

SOURCE    [organism name] OR [common name]
ORGANISM  [organelle prefix] organism name

   --- New GenBank format ---

SOURCE    [organelle prefix] organism name ([optional second name])
ORGANISM  organism name

The optional second name can be one of the following (ordered by precedence) -

  'synonym' from the source feature organism modifiers (submitter-supplied)
  'acronym' from the source feature organism modifiers (submitter-supplied)
  'anamorph' from the source feature organism modifiers (submitter-supplied)
  'common' from the source feature organism modifiers (submitter-supplied)

  'genbank synonym' from the taxonomy database
  'genbank acronmym' from the taxonomy database
  'genbank anamorph' from the taxonomy database
  'genbank common name' from the taxonomy database

The first set allows us to customize the flatfiles of particular entries,
the last allows us to add useful & informative information from the
taxonomy database, with a more reasonable presentation than the previous
format allowed.

The 'anamorph' names appear within parentheses prefixed with (anamorph: ---). 
The 'common name', 'acronym' and 'synonym' fields are parenthesized without
a prefix (see examples below).

The SOURCE line organelle prefix corresponds to the most detailed portion
of the string value for the /organelle qualifier of the source feature. This
allows us to annotate everything with the correct general terms, yet prominently
display the familiar 'Chloroplast' & 'Kinetoplast' :

  organelle qualifer            SOURCE organelle prefix
  -----------------             -----------------------
  "plastid"                     plastid
  "mitochondrion"               mitochondrion
  "nucleomorph"                 nucleomorph
  "mitochondrion: kinetoplast"  kinetoplast
  "plastid: chloroplast"        chloroplast
  "plastid: apicoplast"         apicoplast
  "plastid: chromoplast"        chromoplast
  "plastid: cyanelle"           cyanelle
  "plastid: leucoplast"         leucoplast
  "plastid: protoplast"         protoplast

=======================================================
======          Examples of the new format       ======
=======================================================

In all of the examples below, the source feature qualifiers given in the first
part of the example automatically generate the SOURCE & ORGANISM lines shown:

------------------------

  /organism="Sus scrofa"

SOURCE      Sus scrofa (pig)
ORGANISM    Sus scrofa

'pig' is the genbank common name from the GenBank taxonomy database.

------------------------

  /organism="Sus scrofa"
  /note="common: Japanese wild boar"

SOURCE      Sus scrofa (Japanese wild boar)
ORGANISM    Sus scrofa

The common name from the source feature (submittor-suppllied) for
the entry overrides the common name from the GenBank taxonomy database
with the new SOURCE format.

------------------------

  /organism="Takifugu rubripes"

SOURCE       Takifugu rubripes (Fugu rubripes)
ORGANISM     Takifugu rubripes

'genbank synonym' from the taxonomy database is displayed on the SOURCE
line.

------------------------

  /organism="Takifugu rubripes"
  /note="common: Sydney's pufferfish"

SOURCE       Takifugu rubripes (Sydney's pufferfish)
ORGANISM     Takifugu rubripes

Any of the customizing fields from the entry itself take precedence
over the default values from the taxonomy database.

------------------------

  /organism="Cauliflower mosaic virus"

SOURCE       Cauliflower mosaic virus (CaMV)
ORGANISM     Cauliflower mosaic virus

If there is a single acronym listed in the taxonomy database,
it will appear on the SOURCE line.

------------------------

'genbank anamorph' (from the taxonomy database) 

  /organism="Emericella nidulans"

SOURCE       Emericella nidulans (anamorph: Aspergillus nidulans)
ORGANISM     Emericella nidulans

The 'anamorph' nametype is prefixed with "anamorph:" on the SOURCE line
to distinguish it from a taxonomic synonym.

------------------------

  /organism="Mytilus californicus"
  /organelle="mitochondrion"

SOURCE       mitochondrion Mytilus californicus (California mussel)
ORGANISM     Mytilus californicus

Organelle prefix moved to SOURCE; common name from the GenBank taxonomy
database has been added to SOURCE .

1.3.5 GSS File Header Problem

  GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.

  There is thus a discrepancy between the filenames and file headers of nine
GSS flatfiles in Release 132.0. Consider the gbgss51.seq file:

GBGSS1.SEQ           Genetic Sequence Data Bank
                          October 15 2002

                NCBI-GenBank Flat File Release 132.0

                           GSS Sequences (Part 1)

   88406 loci,    66834201 bases, from    88406 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "51" based on the files dumped from the other system.

  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 * * Cumulative GenBank Update Products To Be Discontinued * *

  As of GenBank Release 134.0 in February of 2002, the cumulative
GenBank Update (GBCU) products will be discontinued:

	ftp://ftp.ncbi.nih.gov/ncbi-asn1/daily/gbcu.aso.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.flat.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.fsa_nt.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.gnp.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gbcu.qscore.gz
	ftp://ftp.ncbi.nih.gov/genbank/daily/gpcu.fsa.gz

  In the eight weeks between typical GenBank Releases, it is not uncommon
for GBCU products to approach 20% of the total database size. The flatfile
version, for example, has reached sizes in excess of 15 GB in recent weeks.

  From a user perspective, repeatedly obtaining and processing such a
large update product makes inefficient use of both bandwidth and local
resources, compared to the much smaller incremental GbUpdate products.

  And in order to reliably generate the GBCU in the face of such explosive
growth, NCBI would have to invest significant resources to increase the 
performance of a large body of software.

  Given these factors, plus the questionable value of an "update" product,
generated daily, which will soon approach 20GB in size, we have decided
that the GBCU should be discontinued. We will analyze FTP logs and 
proactively contact the larger centers which utilize the GBCU, to suggest
alternate processing strategies.

  If large numbers of users are unable to switch to processing incremental
updates by February 2002, there is a possibility that the date for
discontinuing the GBCU might be pushed back to April.

  We will keep users informed of the timetable for this important change
via these release notes and the GenBank newsgroup. And of course, we 
welcome discussions of this change via the newsgroup.

1.4.2 New SET Data File For The ASN.1 Representation

  Some phylogenetic and mutational studies involve sequences from more
than one of the 'taxonomic' divisions of GenBank. For example, a phylogenetic
study might involve sequences obtained from human (PRI) and non-primate
mammalian (MAM) sources.

  Such studies, often with associated sequence alignments, are maintained
and edited as a single unit in the underlying data representation (ASN.1)
utilized by the NCBI.

  When generating GenBank flatfiles from such studies, the component
sequences are processed in such a way that they are individually directed
to an appropriate divisional file. For example, the human sequences to a
PRI division file, the other mammalian sequences to the MAM division file.

  In the past, we have mimicked this behavior for the ASN.1 version of
GenBank releases by splitting the studies into their components, and 
splicing them into an appropriate divisional ASN.1 file (eg, gbpri1.aso
and gbmam.aso) .

  This practice has clear disadvantages: the components of the studies
really should *not* be separated; and the post-processing of these special
studies adds considerable overhead to release processing.

  Starting with GenBank Release 133.0 in December of 2002, we will cease
this practice and introduce a new ASN.1 data file for such multi-divisional
studies at ftp://ftp.ncbi.nih.gov/ncbi-asn1 . The new file will be :

	gbset.aso

1.4.3 Reduction In The Number Of ASN.1 Data Files

  The sizes of many of the files for the ASN.1 version of GenBank releases
(see ftp://ftp.ncbi.nih.gov/ncbi-asn1 ) is well below the 250 MB utilized
for the GenBank flatfile version. For example, the PRI division ASN.1 files
are about 140 MB apiece, and the HTG division files are about 190 MB apiece.

  This is due to the fact that the ASN.1 representation was originally
used to create the flatfile version, on a file-by-file basis, during release
generation. Since the ASN.1 version is more compact than the flatfile version,
the ASN.1 file sizes had to be less than 250 MB to yield 250 MB flatfiles.

  Now that the ASN.1 and flatfile versions are created independently, the
sizes of the ASN.1 files can be increased without consequences for the
flatfiles.

  Starting with the December 2002 Release 133.0, the file size limit for all
ASN.1 files will be increased to 250MB, and as a result, the total number of
ASN.1 files will be significantly reduced.

---


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  





More information about the Genbankb mailing list