GenBank Release 137.0 Now Available
cavanaug at ncbi.nlm.nih.gov
Fri Aug 22 21:14:33 EST 2003
Greetings GenBank Users,
GenBank Release 137.0 is now available via ftp from the National
for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 137.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 137.0
Uncompressed, the Release 137.0 flatfiles require approximately 112 GB
(sequence files only) or 126 GB (including the 'short directory' and
'index' files). The ASN.1 version requires approximately 98 GB. From
Release Date Base Pairs Entries
136 Jun 2003 32528249295 25592865
137 Aug 2003 33865022251 27213748
Close-of-data was 08/18/2003. Four days were required to prepare this
release. In the nine week period between the close dates for GenBank
releases 136.0 and 137.0, GenBank grew by 1,336,772,956 basepairs and by
1,620,883 sequence records. During that same period, 310,938 records
were updated. Combined, this yields an average of about 30,660
new/updated records per day.
We would like to remind our users that GenBank mirrors are available
at ftp://genbank.sdsc.edu/pub and ftp://bio-mirror.net/biomirror/genbank.
Those who experience slow FTP transfers of large files might realize an
improvement in transfer rates from these alternate sites when traffic at
the NCBI is heavy.
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 137.0 and Upcoming Changes) have been appended
*NOTE* Section 1.4.1 discusses a very important change : the removal
of sequence length limits for all classes of GenBank sequence records,
as of June 2004. We strongly encourage all users to review this
Release 137.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
If you encounter problems while ftp'ing or uncompressing Release
137.0, please send email outlining your difficulties to
info at ncbi.nlm.nih.gov .
Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev, Michael Kimelman
1.3 Important Changes in Release 137.0
1.3.1 Organizational changes
The total number of sequence data files increased by 24 with this
- the BCT division is now comprised of 8 files (+1)
- the EST division is now comprised of 270 files (+11)
- the GSS division is now comprised of 84 files (+7)
- the PAT division is now comprised of 10 files (+2)
- the PLN division is now comprised of 9 files (+1)
- the PRI division is now comprised of 26 files (+1)
- the ROD division is now comprised of 9 files (+1)
1.3.2 GSS File Header Problem
GSS sequences at GenBank are maintained in one of two different
systems, depending on their origin. One recent change to release
processing involves the parallelization of the dumps from those
systems. Because the second dump (for example) has no prior knowledge
of exactly how many GSS files will be dumped from the first, it
doesn't know how to number it's own output files.
There is thus a discrepancy between the filenames and file headers of
eleven GSS flatfiles in Release 137.0. Consider the gbgss74.seq file:
GBGSS1.SEQ Genetic Sequence Data Bank
August 15 2003
NCBI-GenBank Flat File Release 137.0
GSS Sequences (Part 1)
86694 loci, 65544743 bases, from 86694 reported sequences
Here, the filename and part number in the header is "1", though the
file has been renamed as "74" based on the files dumped from the other
We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 **Sequence Length Limitation To Be Removed In June 2004**
At the May 2003 collaborative meeting among representatives of
GenBank, EMBL, and DDBJ, it was decided that the 350 kilobase limit on
the sequence length of database records will be removed as of June
Individual, complete sequences are currently expected to be a maximum
of 350 kbp in length. One major reason for the existence of this limit
is as an aid to users of sequence analysis software, some of which
might not be capable of processing megabase-scale sequences.
However, very significant exceptions to the 350 kbp limit have existed
for several years; Phase 1 (unordered, unoriented) and Phase 2 (ordered,
oriented) high-throughput genomic sequences (HTGS) generated by efforts
such as the Human Genome Project; large dispersed eukaryotic genes with
an intron/exon structure that spans more than 350 kbp; and sequences
which result from assemblies of Whole Genome Shotgun (WGS) project data.
Given these exceptions, and the technological advances which have made
large-scale sequencing practical for an increasing number of
researchers, the collaboration has decided that the 350 kbp limit must
As of June 2004, the length of database sequences will be limited only
by the natural structures of an organism's genome. For example, a single
record might be used to represent all of human chromosome 1, which is
approximately 245 Mbp in length.
Software developers for some of the larger commercial sequence
analysis packages were recently asked what timeframe would be
appropriate for this change. Answers ranged from "immediately", to
"several months", to "one year". So the one-year timeframe was
selected, to provide ample time to implement changes which
megabase-scale sequences may require.
Some sample records with very large sequences have been made available
so that developers can begin to test their software modifications:
Many changes are expected after the removal of the length limit. For
example, complete bacterial genomes (typically on the order of several
megabases) will be re-assembled into single sequence records. The
submission process for such genomes will become much more streamlined,
since database staff will no longer have to split the genomes into
pieces. BLAST services will be enchanced, so that hits reported within
very large sequences will be presented in a meaningful context.
All such changes will be discussed more fully in future release notes,
the NCBI newsletter, and the GenBank newsgroup.
1.4.2 BASECOUNT line to be dropped
The BASECOUNT line of the GenBank flatfile format provides totals for
the number of A, T, G, C, and 'other' basepairs that are present within
the sequence of a database record. For example:
LOCUS AY244763 5686 bp DNA linear BCT 10-APR-2003
DEFINITION Rhodococcus sp. DS7 cysDNCQ operon, complete sequence.
VERSION AY244763.1 GI:29725657
BASE COUNT 1137 a 1661 c 1821 g 1066 t 1 others
1 cgcggtttgt gacgtctgat tgccggtcat tgacctttgg gtagaacgag ttctattctg
61 tgattgcgtt caatttagaa ccagtccggt acataaatgt accgatgcgg aaatggtgtt
5281 tgtcagctcg gtgtctggng gcgaggctaa gcaccaacgg cttcggtagc agaaccacat
This information is computationally expensive to produce for very
large sequences, and for sequences in the CON division. In the CON
division case, a record might be comprised of 'pointers' to hundreds,
or even thousands, of underlying GenBank records. So to calculate the
BASECOUNT line content, retrievals of sequence data for those many
records must be performed. This can noticeably impact the response
time for flatfile generation within the Entrez application.
Hence, as of the October 2003 GenBank Release (138.0), the BASECOUNT
linetype will no longer be present in GenBank Release and GenBank
Depending on demand, a display option might be implemented in Entrez
which allows users to choose to have BASECOUNT shown.
1.4.3 New oriT feature
As of Release 138.0 in October 2003, a new feature key (oriT) will be
legal for the feature table. Preliminary documentation for this new
feature is available:
Feature Key oriT
Definition origin of transfer; region of a plasmid where
transfer is initiated during the process of
conjugation or mobilisation.
Mandatory qualifiers: None
Optional Qualifiers: /bound_moiety="text"
/locus_tag="text" (single token)
Molecule Scope: DNA
Comments: rep_origin should be used for origins
of replication; /direction has
legal values RIGHT, LEFT and BOTH,
however only RIGHT and LEFT are valid
when used in conjunction with the oriT
1.4.4 New /ecotype qualifier
As of the October 2003 GenBank Release (138.0), a new source feature
qualifier called /ecotype will begin to be used. The preliminary
definition for /ecotype is :
Definition A distinct population of organisms of a
widespread species that has adapted
gentically to its own local habitat.
Nevertheless, they can still reproduce
with members of other ecotypes of the
Value format "text"
Comment 'Ecotype' is often applied to standard
genetic stocks of Arabidopsis thaliana,
but it can be applied to any organism,
especially sessile organisms like plants.
1.4.5 Change to value format of /rpt_unit
As of Release 138.0 in October 2003, the value-format of the /rpt_unit
qualifier will be changed to allow 'text' . The currently documented
Value format <feature_label> or <base_range>
Comment used to indicate feature which defines (or base range
repeat unit of which a repeat region is made
However, a very common value for /rpt_unit is a literal sequence
string that represents the repeating unit(s). For example:
So the format of this qualifier will be changed to:
Value format "text" or <feature_label> or <base_range>
Existing feature label and base range values will eventually be
presented as text values.
1.4.6 New operon feature
Starting with the October 2003 release (138.0), a new feature will be
legal for the feature table:
Feature Key: operon
Definition: region containing polycistronic transcripts
and regulatory sequences containing genes
that encode enzymes that are in the same
Optional qualifiers: /allele="text"
/locus_tag="text" (single token)
In bacteria, many genes encoding for specific biosynthetic pathways
transcribed in polycistronic operons. It has been challenging to reflect
this biology within the GenBank flatfile format. GenBank has been using
a gene feature that spans the entire regulatory region and all of the
coding regions and then gene features corresponding to the individual
genes spaning the coding genes.
The new operon feature will simplify the annotation of cases like
these. Examples of the new operon feature will be provided in future
1.4.7 [er] prefix for JOURNAL line
As of Release 138.0 in October 2003, a new prefix will be legal for
the JOURNAL line: [er] .
This prefix is an abbreviation for Electronic Resource, which is a
term that describes journal articles that are available on-line.
In 1999, an interim 'Online Publication' REFERENCE format was adopted
for use at GenBank in order to cite articles appearing only
REFERENCE 1 (bases 1 to 2858)
AUTHORS Smith, J.
TITLE Cloning and expression of a phospholipase gene
JOURNAL Online Publication
REMARK Online-Journal-name; Article Identifier; URL
In subsequent years, no standards for citing on-line journal articles
have emerged from library organizations.
One such library organization (National Library of Medicine, NIH) is
now assigning identifiers (Medline UIs and PubMed Ids) to articles
published on-line, and it is presenting these articles in a manner
that is identical to print-journal articles. For example:
AUTHORS Haas,B.J., Volfovsky,N., Town,C.D., Troukhan,M., Alexandrov,N.,
Feldmann,K.A., Flavell,R.B., White,O. and Salzberg,S.L.
TITLE Full-length messenger RNA sequences greatly improve genome
JOURNAL Genome Biol. 3 (6), RESEARCH0029 (2002)
Although these citations may contain journal abbreviations, volume
numbers, issue/part/supplement numbers, pages, and year (just like a
print-journal citation), there is no guarantee that the contents of
these fields will be comparable to those of print-journal citations.
In the case above, although the page number is a bit unusual
("RESEARCH0029"), software that processes the JOURNAL line would
probably still be able to parse its contents. But there is also a
possibility that these fields could contain unusual characters
(embedded spaces, commas, parentheses), and possibly even URLs. So the
use of [er] :
JOURNAL [er] Genome Biol. 3 (6), RESEARCH0029 (2002)
will act as a warning (primarily to software) that the contents of the
JOURNAL line might not be as parsable as a print-journal JOURNAL line.
1.4.8 Accession format of WGS records
Whole Genome Shotgun (WGS) sequences utilize an accession number
format which is different from those used for non-WGS GenBank
sequences. This format is referred to as 4 + 2 + 6, and is comprised
- a 4-letter WGS project code
- a 2-digit assembly-version number
- a 6 (and sometimes 7) digit sequence number
Because of their unique nature, WGS sequences are kept separate from
other GenBank products:
For example, sequences much larger than the current 350 kbp limit can
be generated during the WGS assembly phase. In addition, there is no
tracking of nucleotide sequences from one assembly to the next. So the
accessions of one:
are not necessarily related in any way to those of the next assembly:
When a WGS project is completed, it is possible that the submittors
may chose to submit a single finished WGS sequence with a 4 + 2 + 6
accession, at which point it would appear in the non-WGS portion of
Alternately, the submittors might chose to submit the completed genome
via a non-WGS method, in which case a de-novo non-WGS accession would
be assigned. That record would then have one or more 2 + 4 + 6 WGS
accessions as secondary accessions.
Both scenarios are likely to occur, especially after the 350 kbp
sequence length restriction is lifted. So we felt it was important to
alert users that WGS accessions will eventually be encountered in the
non-WGS portion of GenBank, as primary or secondary accession numbers.
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca
More information about the Genbankb