GenBank Release 142.0 Now Available
cavanaug at ncbi.nlm.nih.gov
Fri Jun 25 00:08:38 EST 2004
Greetings GenBank Users,
GenBank Release 142.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 142.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 142.0
Uncompressed, the Release 142.0 flatfiles require approximately 136 GB
(sequence files only) or 153 GB (including the 'short directory' and
'index' files). The ASN.1 version requires approximately 119 GB. From
the release notes:
Release Date Base Pairs Entries
141 Apr 2004 38989342565 33676218
142 Jun 2004 40325321348 35532003
Close-of-data was 06/18/2004. Six business days were required to prepare
this release. In the eight week period between the close dates for GenBank
releases 141.0 and 142.0, the non-WGS portion of GenBank grew by 1,335,978,783
basepairs and by 1,855,785 sequence records. During that same period, 345,286
records were updated. Combined, this yields an average of about 37,300 new
and/or updated records per day.
Between releases 141.0 and 142.0, the WGS component of GenBank grew by
834,202,151 basepairs and by 241,358 sequence records.
*NOTE* Problems were encountered during release processing which
prevented the generation of the gbjou.idx 'index' file for GenBank 142.0.
Please see Section 1.3.1 of the release notes for further details.
We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank.
Those who experience slow FTP transfers due to a high volume of traffic at
NCBI might realize an improvement in transfer rates from these alternate sites.
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 142.0 and Upcoming Changes) have been appended
*NOTE* Section 1.3.3 discusses a very important change : the removal
of sequence length limits for all classes of GenBank sequence records,
as of this June 2004 release. We strongly encourage all users to review this
Release 142.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
If you encounter problems while ftp'ing or uncompressing Release
142.0, please send email outlining your difficulties to
info at ncbi.nlm.nih.gov .
Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman
1.3 Important Changes in Release 142.0
1.3.1 One index file is unavailable for GenBank 142.0
A software problem prevented the generation of one of the 'index'
files which normally accompany GenBank releases:
Our apologies for any inconvenience that this might cause.
1.3.2 Organizational changes
The total number of sequence data files increased by 26 with this release:
- the EST division is now comprised of 321 files (+16)
- the GSS division is now comprised of 110 files (+6)
- the PLN division is now comprised of 12 files (+1)
- the PRI division is now comprised of 28 files (+1)
- the ROD division is now comprised of 13 files (+1)
- the STS division is now comprised of 5 files (+1)
1.3.3 **Sequence Length Limitation Removed As Of June 2004**
At the May 2003 collaborative meeting among representatives of GenBank,
EMBL, and DDBJ, it was decided that the 350 kilobase limit on the sequence
length of database records should be removed as of June 2004.
Previously, individual, complete sequences were expected to be a maximum
of 350 kbp in length. One major reason for the existence of the limit was
as an aid to users of sequence analysis software, some of which might not
be capable of processing megabase-scale sequences.
However, very significant exceptions to the 350 kbp limit have existed
for several years; Phase 1 (unordered, unoriented) and Phase 2 (ordered,
oriented) high-throughput genomic sequences (HTGS) generated by efforts
such as the Human Genome Project; large dispersed eukaryotic genes with
an intron/exon structure that spans more than 350 kbp; and sequences
which result from assemblies of Whole Genome Shotgun (WGS) project data.
Given these exceptions, and the technological advances which have made
large-scale sequencing practical for an increasing number of researchers,
the collaboration decided that the 350 kbp limit must be removed.
Software developers for some of the larger commercial sequence analysis
packages were asked in May 2003 what timeframe would be appropriate for this
change. Answers ranged from "immediately", to "several months", to "one year".
So the one-year timeframe was selected, to provide ample time to implement
changes which megabase-scale sequences might require.
As of GenBank 142.0, the length of database sequences will be limited only
by the natural structures of an organism's genome. For example, it is possible
that a single record might be used to represent all of human chromosome 1,
which is approximately 245 Mbp in length.
Some sample records with very large sequences have been made available
so that developers can test their software with them:
Several changes are expected after the removal of the length limit. For
example, complete bacterial genomes (typically on the order of several
megabases) will be re-assembled into single sequence records. And the submission
process for such genomes will become much more streamlined, since database
staff will no longer have to split the genomes into pieces.
1.3.4 Rename of File 'Last.Release' and Deletion of /daily Subdirectories
The files named Last.Release, which used to be located at:
contain the number of the GenBank release which is currently installed
on the NCBI FTP site. As of Release 142.0 in June 2004, these files have
been moved and renamed as:
The /daily subdirectories, which had been used for cumulative update
products that are no longer supported, have been deleted.
1.3.3 GSS File Header Problem
GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.
There is thus a discrepancy between the filenames and file headers for
eighteen GSS flatfiles in Release 142.0. Consider the gbgss93.seq file:
GBGSS1.SEQ Genetic Sequence Data Bank
June 15 2004
NCBI-GenBank Flat File Release 142.0
GSS Sequences (Part 1)
88249 loci, 65635082 bases, from 88249 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "93" based on the files dumped from the other system.
We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 New qualifier : /old_locus_tag
The /locus_tag qualifier was introduced in April 2003 to provide
a method for systematically identifying genes, coding regions and
other features which typically result from computational analysis.
This qualifier is typically used instead of /gene .
Sometimes the /locus_tag identifier series supplied by a submitter
of sequence data undergoes a change. Because the original /locus_tag
identifiers might be referenced in journal articles, or in databases,
a method of preserving the original identifiers is needed.
A new qualifier, /old_locus_tag , will be introduced as of October
2004 for this purpose. A formal description of the qualifier will be
made available via upcoming GenBank release notes, and via the GenBank
1.4.2 New type of gap() operator
CON-division records utilize a CONTIG line with a join() statement,
which specifies how sequences can be combined to form a much larger
object. For example:
LOCUS AE016959 23508449 bp DNA linear CON 12-JUN-2003
DEFINITION Oryza sativa (japonica cultivar-group) chromosome 10, complete
A gap operator is legal in these join statements. 'gap()' indicates a gap
of unknown length. 'gap(N)', where 'N' is a positive integer, indicates a gap
with a physically-estimated length of 'N' bases.
In some sequencing projects, a convention is agreed upon by which gaps
of unknown length are all represented by a uniform value, such as 100.
To capture this usage, a new type of gap operator will be legal as of
October 2004 : 'gap(unkN)', where 'N' is a positive integer. For a gap of
length 100, utilized by convention rather than reflective of the gap's
actual size, the operator would be:
This new gap operator will make clear the distinction between a
gap with a physically-estimated length, and a gap with a length that
has no actual physical basis. Further details about this new operator
will be made available via these release notes and the GenBank newsgroup.
1.4.3 New /compare qualifier
Four different features exist which can be used to annotate regions
of sequence that are either uncertain or that differ in comparison
to some other sequence:
A /citation qualifier is used to refer to a publication that details
the nature of the uncertain or differing bases. However, a publication
may not always be available (unpublished references), and simply
referring to a publication is quite indirect.
The new /compare qualifer will provide a method for directly
referencing a base range on a record that exhibits a sequence
A formal description of /compare will be made available via upcoming
GenBank release notes, and via the GenBank newsgroup.
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at bioinformatics.ubc.ca
More information about the Genbankb