GenBank Release 144.0 Now Available
cavanaug at ncbi.nlm.nih.gov
Wed Oct 20 02:29:29 EST 2004
Greetings GenBank Users,
GenBank Release 144.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 144.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 144.0
Close-of-data was 10/13/2004. Five business days were required to build
Release 144.0. Uncompressed, the Release 144.0 flatfiles require approximately
147 GB (sequence files only) or 166 GB (including the 'short directory' and
'index' files). The ASN.1 version requires approximately 128 GB. From
the release notes:
Release Date Base Pairs Entries
143 Aug 2004 41808045653 37343937
144 Oct 2004 43194602655 38941263
In the eight week period between the close dates for GenBank Releases 143.0
and 144.0, the non-WGS portion of GenBank grew by 1,386,557,002 basepairs
and by 1,597,326 sequence records. During that same period, 349,631 records
were updated. Combined, this yields an average of about 31,900 new and/or
updated records per day.
Between releases 143.0 and 144.0, the WGS component of GenBank grew by
2,742,978,532 basepairs and by 857,503 sequence records.
* * * Important * * *
The GenBank mirror located at ftp://genbank.sdsc.edu/pub is out of service
for several weeks. Users should not use the SDSC mirror for GenBank 144.0.
The alternate mirror at ftp://bio-mirror.net/biomirror/genbank remains available.
As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are current
(October 15 2004, 144.0). If they are not, interrupt the remaining transfers and
then request assistance from the NCBI Service Desk.
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 144.0 and Upcoming Changes) have been appended
Release 144.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
If you encounter problems while ftp'ing or uncompressing Release
144.0, please send email outlining your difficulties to
info at ncbi.nlm.nih.gov .
Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman
1.3 Important Changes in Release 144.0
1.3.1 Organizational changes
The total number of sequence data files increased by 24 with this release:
- the EST division is now comprised of 349 files (+14)
- the GSS division is now comprised of 120 files (+4)
- the HTG division is now comprised of 62 files (+1)
- the INV division is now comprised of 7 files (+1)
- the PAT division is now comprised of 16 files (+1)
- the PLN division is now comprised of 13 files (+1)
- the ROD division is now comprised of 14 files (+1)
In addition, the MAM division has been newly split into two files,
gbmam1.seq and gbmam2.seq (+1).
1.3.2 New qualifier : /old_locus_tag
The /locus_tag qualifier was introduced in April 2003 to provide
a method for systematically identifying genes, coding regions and
other features which typically result from computational analysis.
This qualifier is often used instead of /gene .
Sometimes the /locus_tag identifier series supplied by a submitter
of sequence data undergoes a change. Because the original /locus_tag
identifiers might be referenced in journal articles, or in databases,
a means of presenting the original identifiers is needed.
So a new qualifier, /old_locus_tag , has been introduced as of this
October 2004 release :
Definition feature tag assigned for tracking purposes
Value Format "text" (single token)
Comment /old_locus_tag can be used with any feature where /gene is valid and
where a /locus_tag qualifier is present.
Identical /old_locus_tag values may be used within an entry/record,
but only if the identical /old_locus_tag values are associated
with the same gene; in all other circumstances the /old_locus_tag
value must be unique within that entry/record.
Multiple/old_locus_tag qualifiers with distinct values are
allowed within a single feature; /old_locus_tag and /locus_tag
values must not be identical within a single feature.
1.3.3 New type of gap() operator
CON-division records utilize a CONTIG line with a join() statement,
which specifies how sequences can be combined to form a much larger
object. For example:
LOCUS AE016959 23508449 bp DNA linear CON 12-JUN-2003
DEFINITION Oryza sativa (japonica cultivar-group) chromosome 10, complete
A gap operator is legal in these join statements:
gap() : indicates a gap of unknown length
gap(N) : where 'N' is a positive integer, indicates a gap with a
physically-estimated length of 'N' bases.
In some sequencing projects, a convention is agreed upon by which gaps
of unknown length are all represented by a uniform value, such as 100.
To reflect this convention, a new type of gap operator is legal as of
October 2004 : 'gap(unkN)', where 'N' is a positive integer. For a gap of
length 100, utilized by convention rather than reflective of the gap's
actual size, the operator would be:
This new gap operator will make clear the distinction between a
gap with a physically-estimated length, and a gap with a length that
has no actual physical basis.
1.3.3 New /compare qualifier
Five different features exist which can be used to annotate regions
of sequence that are either uncertain or that differ in comparison
to some other sequence:
A /citation qualifier is used to refer to a publication that details
the nature of the uncertain or differing bases. However, a publication
may not always be available (unpublished references), and simply
referring to a publication is quite indirect.
The new /compare qualifer provides a method for directly
referencing a particular sequence that exhibits a sequence
This new qualifier is legal as of this October 2004 GenBank Release.
The complete description of /compare is as follows:
Definition Reference details of an existing public INSD entry
to which a comparison is made
Value format [accession-number.sequence-version]
Comment This qualifier may be used on the following features:
misc_difference, conflict, unsure, old_sequence
and variation. The features "old_sequence" and "conflict" must
have either a /citation or a /compare qualifier. Multiple /compare
qualifiers with different contents are allowed within a
This qualifier is not intended for large-scale annotation
of variations, such as SNPs.
1.3.4 GSS File Header Problem
GSS sequences at GenBank are maintained in one of two different systems,
depending on their origin. One recent change to release processing involves
the parallelization of the dumps from those systems. Because the second dump
(for example) has no prior knowledge of exactly how many GSS files will be
dumped from the first, it doesn't know how to number it's own output files.
There is thus a discrepancy between the filenames and file headers for
eighteen GSS flatfiles in Release 144.0. Consider the gbgss100.seq file:
GBGSS1.SEQ Genetic Sequence Data Bank
October 15 2004
NCBI-GenBank Flat File Release 144.0
GSS Sequences (Part 1)
88260 loci, 65614942 bases, from 88260 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "100" based on the files dumped from the other system.
We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 New gap feature
A new feature key for sequence gaps will become legal as of the
December 2004 GenBank release:
Feature key gap
Definition gap in the sequence
Mandatory qualifiers /estimated_length=unknown or <integer>
Optional qualifiers /map="text"
Comment the location span of the gap feature for an unknown
gap is 100 bp, with the 100 bp indicated as 100 "n"s in
the sequence. Where estimated length is indicated by
an integer, this is indicated by the same number of
"n"s in the sequence.
No upper or lower limit is set on the size of the gap.
1.4.2 Continuous ranges of secondary accessions
With the removal of sequence length limits, some genomes (typically
bacterial) that had been split into many pieces are gradually being
replaced by a single sequence record. U00096 is a good example.
When this happens, the accessions of the former small pieces become
secondary accessions for the single large sequence record. When each
secondary is separately listed, the ACCESSION line becomes excessively
As of GenBank Release 146.0 in February 2005, it will be legal to
represent continuous ranges of secondary accessions by a start accession,
a dash character, and an end accession. In the case of U00096, the
ACCESSION line would thus look like:
ACCESSION U00096 AE000111-AE000510
Further details about the conventions for secondary accession ranges
will be provided via these release notes and the GenBank newsgroup.
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at bioinformatics.ubc.ca
More information about the Genbankb