GenBank Release 139.0 Now Available
cavanaug at ncbi.nlm.nih.gov
Wed Dec 24 11:27:33 EST 2003
Greetings GenBank Users,
GenBank Release 139.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 139.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 139.0
Uncompressed, the Release 139.0 flatfiles require approximately 122 GB
(sequence files only) or 138 GB (including the 'short directory' and
'index' files). The ASN.1 version requires approximately 108 GB. From
the release notes:
Release Date Base Pairs Entries
138 Oct 2003 35599621471 29819397
139 Dec 2003 36553368485 30968418
Close-of-data was 10/19/2003. Four days were required to prepare this
release. In the eight week period between the close dates for GenBank
releases 138.0 and 139.0, GenBank grew by 953,747,014 basepairs and by
1,149,021 sequence records. During that same period, 56,198 records
were updated. Combined, this yields an average of about 21,500
new/updated records per day.
We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank.
Those who experience slow FTP transfers of large files might realize an
improvement in transfer rates from these alternate sites when traffic at
the NCBI is heavy.
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 139.0 and Upcoming Changes) have been appended
*NOTE* Section 1.4.1 discusses a very important change : the removal
of sequence length limits for all classes of GenBank sequence records,
as of June 2004. We strongly encourage all users to review this
Release 139.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
If you encounter problems while ftp'ing or uncompressing Release
139.0, please send email outlining your difficulties to
info at ncbi.nlm.nih.gov .
Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev, Michael Kimelman
1.3 Important Changes in Release 139.0
1.3.1 Organizational changes
The total number of sequence data files increased by 8 with this release:
- the EST division is now comprised of 288 files (+9)
- the PAT division is now comprised of 11 files (+1)
- the PLN division is now comprised of 10 files (+1)
- the PRI division is now comprised of 27 files (+1)
- the ROD division is now comprised of 11 files (+1)
- the HTG division is now comprised of 61 files (-1)
- the GSS division is now comprised of 98 files (-4)
Updates to a significant number of GSS sequences has resulted in a
*decrease* in the overall number of GSS sequence files, from 102
The decrease in the number of HTG sequence files reflects the
on-going finishing of DRAFT-quality sequences.
1.3.2 WGS Accessions and Finished WGS Projects
Whole Genome Shotgun (WGS) sequences utilize an accession number format which
is different from those used for non-WGS GenBank sequences. This format is
referred to as 4 + 2 + 6, and is comprised of:
- a 4-letter WGS project code
- a 2-digit assembly-version number
- a 6 (and sometimes 7) digit sequence number
Because of their unique nature, WGS sequences are kept separate from other
For example, sequences *much* larger than the current 350 kbp limit can be
generated during the WGS assembly phase. In addition, there is no tracking
of nucleotide sequences from one assembly to the next. So the accessions
are not necessarily related in any obvious way to those of the next assembly:
When a WGS project is completed, it is possible that the submittors may chose
to submit a single finished WGS sequence with a 4 + 2 + 6 accession, at which
point it could appear in the non-WGS portion of GenBank.
Alternately, the submittors might chose to submit the completed genome via
a non-WGS method, in which case a de-novo non-WGS accession would be assigned.
That record would then have one or more 2 + 4 + 6 WGS accessions as
The second of these scenarios has recently occurred for one bacterial genome:
LOCUS AE017199 490885 bp DNA circular BCT 18-DEC-2003
DEFINITION Nanoarchaeum equitans Kin4-M, complete genome.
ACCESSION AE017199 AACL01000001 AACL01000000
VERSION AE017199.1 GI:40068520
Both scenarios will become increasingly frequent, especially after the 350 kbp
sequence length restriction is lifted (see Section 1.4.1). So users should be
aware that WGS accessions can be encountered in the non-WGS portion of GenBank,
as primary *or* secondary accession numbers.
1.3.3 GSS File Header Problem
GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.
There is thus a discrepancy between the filenames and file headers of twelve
GSS flatfiles in Release 139.0. Consider the gbgss87.seq file:
GBGSS1.SEQ Genetic Sequence Data Bank
December 15 2003
NCBI-GenBank Flat File Release 139.0
GSS Sequences (Part 1)
87880 loci, 66457089 bases, from 87880 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "87" based on the files dumped from the other system.
We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 **Sequence Length Limitation To Be Removed In June 2004**
At the May 2003 collaborative meeting among representatives of GenBank,
EMBL, and DDBJ, it was decided that the 350 kilobase limit on the sequence
length of database records will be removed as of June 2004.
Individual, complete sequences are currently expected to be a maximum
of 350 kbp in length. One major reason for the existence of this limit is
as an aid to users of sequence analysis software, some of which might not
be capable of processing megabase-scale sequences.
However, very significant exceptions to the 350 kbp limit have existed
for several years; Phase 1 (unordered, unoriented) and Phase 2 (ordered,
oriented) high-throughput genomic sequences (HTGS) generated by efforts
such as the Human Genome Project; large dispersed eukaryotic genes with
an intron/exon structure that spans more than 350 kbp; and sequences
which result from assemblies of Whole Genome Shotgun (WGS) project data.
Given these exceptions, and the technological advances which have made
large-scale sequencing practical for an increasing number of researchers,
the collaboration has decided that the 350 kbp limit must be removed.
As of June 2004, the length of database sequences will be limited only
by the natural structures of an organism's genome. For example, a single
record might be used to represent all of human chromosome 1, which is
approximately 245 Mbp in length.
Software developers for some of the larger commercial sequence analysis
packages were recently asked what timeframe would be appropriate for this
change. Answers ranged from "immediately", to "several months", to "one year".
So the one-year timeframe was selected, to provide ample time to implement
changes which megabase-scale sequences may require.
Some sample records with very large sequences have been made available
so that developers can begin to test their software modifications:
Many changes are expected after the removal of the length limit. For
example, complete bacterial genomes (typically on the order of several
megabases) will be re-assembled into single sequence records. The submission
process for such genomes will become much more streamlined, since database
staff will no longer have to split the genomes into pieces. BLAST services
will be enchanced, so that hits reported within very large sequences will
be presented in a meaningful context.
All such changes will be discussed more fully in future release notes,
the NCBI newsletter, and the GenBank newsgroup.
1.4.2 Filename change : genpept.fsa
A companion file is made available with every GenBank Release that contains
all of the protein sequences, in FASTA format, for the coding regions annotated
on GenBank records:
The term 'GenPept' has been used for purely historical reasons. But in
fact, GenPept is the name of a (non-FASTA) flatfile format for protein sequences,
one that closely parallels the GenBank flatfile format for DNA sequences.
As of Release 140.0 in February 2004, the name of this file will be
where 'NNN' represents the GenBank release number (eg, '140') and 'aa_fsa'
signifies 'protein FASTA' .
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca
More information about the Genbankb