[Genbank-bb] GenBank Release 154.0 Now Available
cavanaug at ncbi.nlm.nih.gov
Fri Jun 16 01:29:48 EST 2006
Greetings GenBank Users,
GenBank Release 154.0 is now available via FTP from the National
Center for Biotechnology Information (NCBI):
Ftp Site Directory Contents
---------------- --------- ---------------------------------------
ftp.ncbi.nih.gov genbank GenBank Release 154.0 flatfiles
ncbi-asn1 ASN.1 data used to create Release 154.0
Close-of-data for GenBank 154.0 occured on 06/09/2006. Uncompressed, the
Release 154.0 flatfiles require roughly 222 GB (sequence files only)
or 232 GB (including the 'short directory', 'index' and the *.txt files).
The ASN.1 data require approximately 192 GB.
Statistics for non-WGS sequences:
Release Date Base Pairs Entries
153 Apr 2006 61582143971 56620500
154 Jun 2006 63412609711 58890345
And for WGS sequences:
Release Date Base Pairs Entries
153 Apr 2006 67488612571 13573144
154 Jun 2006 78858635822 17733973
During the 59 days between the close dates for GenBank Releases 153.0
and 154.0, the non-WGS portion of GenBank grew by 1,830,465,740 basepairs
and by 2,269,845 sequence records. During that same period, 2,184,755 records
were updated. An average of about 75,500 non-WGS records were added and/or
updated per day.
Between releases 153.0 and 154.0, the WGS component of GenBank grew by
11,370,023,251 basepairs and by 4,160,829 sequence records.
The combined (WGS and non-WGS) basepair growth of 13,200,488,991 bases
experienced for GenBank 154.0 represents the largest single-release increase
in the history of the database.
For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 154.0 and Upcoming Changes) have been appended
** Important Note #1 **
A new protein residue abbreviation for the 22nd naturally occurring
amino acid, pyrrolysine, will become legal in GenBank protein sequences
as of October 2006 (Release 156.0). Please see Section 1.4.1 for further
** Important Note #2 **
After recent problems generating the 'index' files which normally
accompany GenBank Releases, these files are once again being provided,
though without any EST content, and without most GSS content. See Section
1.3.3 for further details. NCBI is considering ceasing support for the
index files, so we strongly encourage affected users to review that section
and provide feedback.
Release 154.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.
As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: April 15 2006, 154.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.
A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :
set files = `ls gb*.*`
foreach i ($files)
head -10 $i | grep Release
Or, if the files are compressed, perhaps:
gzcat $i | head -10 | grep Release
If you encounter problems while ftp'ing or uncompressing Release
154.0, please send email outlining your difficulties to:
info at ncbi.nlm.nih.gov
Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman
1.3 Important Changes in Release 154.0
1.3.1 New JOURNAL type for Pre-Grant Patent Publications
Sequences associated with granted patents from the US Patent and
Trademark Office (USPTO) typically have references that look like this:
REFERENCE 1 (bases 1 to 22)
TITLE Screening methods for identifying ligands
JOURNAL Patent: US 6950757-A 2 27-SEP-2005;
The "Patent:" token indicates that the JOURNAL line pertains to a
patent document, as opposed to a published article in the scientific
But sequence data can be available well in advance of the point at which
an actual patent has been granted. As of GenBank Release 154 in June 2006,
a patent sequence associated with a "Pre-Grant Publication" is now
indicated via a slight change to the JOURNAL line:
REFERENCE 1 (bases 1 to 190)
AUTHORS Xu,M. and Humphreys,R.
TITLE Inhibition of li expression in mammalian cells
JOURNAL Pre-Grant Patent: US 20060008448A1 1 12-JAN-2006;
The introduction of "Pre-Grant Patent:" at the start of the JOURNAL
line distinguishes sequences associated with these two different
states in USPTO's patenting process.
Note that pre-grant identifiers from the USPTO are alphanumeric, and
lack a document-type suffix ("-A" in the granted-patent example above).
1.3.2 Organizational changes
The total number of sequence data files increased by 32 with this release:
- the BCT division is now comprised of 15 files (+1)
- the CON division is newly split into 3 pieces (+2)
- the EST division is now comprised of 528 files (+16)
- the GSS division is now comprised of 177 files (+3)
- the HTG division is now comprised of 83 files (+2)
- the INV division is now comprised of 9 files (+1)
- the PAT division is now comprised of 24 files (+4)
- the PLN division is now comprised of 18 files (+1)
- the VRL division is now comprised of 6 files (+1)
- the VRT division is now comprised of 11 files (+1)
In addition, the Short-Directory 'index' file has also been split into
gbsdr1.txt : non-EST and non-GSS short directory entries
gbsdr2.txt : EST short directory entries
gbsdr3.txt : EST short directory entries
1.3.3 Changes in the content of index files
As described in the GB 153 release notes, the 'index' files which accompany
GenBank releases (see Section 3.3) are considered to be a legacy data product by
NCBI, generated mostly for historical reasons. FTP statistics since January 2005
seem to support this: the index files are transferred only half as frequently as
the files of sequence records. The inherent inefficiencies of the index file
format also leads us to suspect that they have little serious use by the user
community, particularly for EST and GSS records.
The software that generated the index file products received little
attention over the years, and finally reached its limitations in
February 2006 (Release 152.0). The required multi-server queries which
obtained and sorted many millions of rows of terms from several different
databases simply outgrew the capacity of the hardware used for GenBank
Our short-term solution is to cease generating index-file content
for all EST sequence records, and for GSS sequence records that originate
via direct submission to NCBI. GenBank 154.0 thus contains these ten index
files, which lack all EST and most GSS content:
In addition, a version of gbacc.idx which encompasses the entirety of the
release was built manually, but note that the first field contains just an
accession number, rather than Accession.Version, and that the file is unsorted.
These 'solutions' are really just stop-gaps, and we will likely pursue
one of two options within the next year:
a) Cease support of the 'index' file products altogether.
b) Provide new products that present some of the most useful data from
the legacy 'index' files, and cease support for other types of index data.
If you are a user of the 'index' files associated with GenBank files, we
encourage you to make your wishes known, either via the GenBank newsgroup,
or via email to NCBI's Service Desk:
info at ncbi.nlm.nih.gov
Our apologies for any inconvenience that these changes may cause.
1.3.4 GSS File Header Problem
GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
There is thus a discrepancy between the filenames and file headers for
thirty-three of the GSS flatfiles in Release 154.0. Consider gbgss145.seq :
GBGSS1.SEQ Genetic Sequence Data Bank
June 15 2006
NCBI-GenBank Flat File Release 154.0
GSS Sequences (Part 1)
86832 loci, 64420446 bases, from 86832 reported sequences
Here, the filename and part number in the header is "1", though the file
has been renamed as "145" based on the number of files dumped from the other
system. We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.
1.4 Upcoming Changes
1.4.1 New protein residue abbreviation for Pyrrolysine
Sequence databases use single-letter amino acid abbreviations to
record the primary structure (sequence) of amino acids in a polypeptide.
The table of abbreviations includes only those amino acids that are
encoded in the genetic code and directly inserted by a tRNA during the
process of protein translation. Post-translational modifications are
not represented in the sequence data itself, but may be described by
features annotated on the sequence.
The discovery of the 22nd naturally encoded amino acid, pyrrolysine,
and the recent submission of sequence records that should contain
this residue, require the adoption of a new amino acid abbreviation.
Because several letters are assigned to represent different experimental
ambiguities, the only letter still available for use is O (uppercase
letter o). Scientists working in the field have independently suggested
use of this letter, and it has a reasonable mnemonic, pyrrOlysine.
IUPAC, the body which is responsible for biochemical nomenclature,
has agreed that Pyl/O will be recommended for this amino acid.
The consequences for flatfile users are that O will appear in CDS
/translation qualifiers, and that Pyl (the three-letter abbreviation)
will appear in CDS /transl_except qualifiers and in the /product and
/anticodon qualifiers of tRNA features. These changes will take effect
as of the October 2006 GenBank release.
Sample records in ASN.1, FASTA, GenBank flatfile, and INSDSeq XML
formats will be made available on the NCBI ftp site for the purpose of
testing software prior to the public introduction of 'O' in protein
For BLAST and other sequence similarity search tools, we expect to map
pyrrolysine (O) to unknown (X), as is already done with selenocysteine
(U), the 21st naturally encoded amino acid. One reason is that the PAM
and BLOSUM substitution matrices do not accommodate these more recently
discovered amino acids. The other reason is that selenocysteine and
pyrrolysine both appear to be used as active sites in certain enzymes,
and thus do not simply substitute for cysteine or lysine.
Here are a few literature references which provide more information
about pyrrolysine :
G. Srinivasan, C. M. James, J. A. Krzycki. Pyrrolysine encoded by
UAG in Archaea: charging of a UAG-decoding specialized tRNA. Science
B. Hao, W. Gong, T.K. Ferguson, C.M. James, J.A. Krzycki, M.K.
Chan. A new UAG-encoded residue in the structure of a methanogen
methyltransferase. Science 2002, 296:1462-1466.
C. Polycarpo, A. Ambrogelly, A. Berube, S.M. Winbush, J.A.
McCloskey, P. F. Crain, J. L. Wood, D. Soll. An aminoacyl-tRNA
synthetase that specifically activates pyrrolysine. Proc. Natl. Acad.
Sci. (USA) 2004, 101:12450-12454.
C. Fenske, G.J. Palm, W. Hinrichs. How unique is the genetic code?
Agnew. Chem. Int. Ed. 2003, 42:606-610.
1.4.2 Protein residue J for leucine/isoleucine ambiguities
The residue abbreviation J is reserved for mass spectrometry experiments that
cannot distinguish leucine from isoleucine. Although this abbreviation has
been part of the IUPAC recommendations for some time, it has not previously
appeared in protein sequences in the GenBank database.
As of October 2006, abbreviation J will be legal in CDS /translation
qualifiers, and Xle (the three-letter abbreviation) will be allowed in CDS
/transl_except qualifiers and in the /product and /anticodon qualifiers of
J will also be mapped to unknown (X) for the purpose of BLAST and other
sequence similarity search tools.
1.4.3 /PCR_primers and modified bases
PCR primers are sometimes constructed which utilize modified bases,
such as those listed in the table of modified bases included in the
Feature Table document:
In October 2006, it will be legal to use modified-base abbreviations
for the /PCR_primers qualifier. For example:
/PCR_primers="fwd_seq: gcagtt<i>caag<gal q>tggagtgaa, rev_seq:
Here, modified bases inosine and beta,D-galactosylqueosine are included
in the forward sequence of the primer pair, and enclosed between angle
brackets ( <...> ) .
Each pair of angle brackets will include only a single modified base
1.4.4 Introduction of /mobile_element qualifier
For repeat_region features, the /transposon and /insertion_seq
qualifiers can be used to describe two specific classes of mobile
elements. But not all mobile elements fall into these two categories,
so a new structured /mobile_element qualifier will be introduced
as of GenBank 155.0 in December 2006. The preliminary description
of the new qualifier is as follows:
Description: Type, and name (or identifier), of the mobile element
which is described by the parent feature.
Value format: <mobile_element_type>:<mobile_element_id>
Where mobile element type is one of the following: transposon,
integron, insertion_sequence, other .
Further details about this new qualifier, the domain of mobile element
types in particular, will be provided in these release notes and via the
GenBank newsgroup as they become available.
1.4.5 New /mol_type value
A new legal molecule type value for viral cRNA sequences will be
introduced as of October 2006:
This value will also be legal for the molecule type field on the
LOCUS line of the GenBank flatfile format. Additional details about
the usage of this new molecule type value will be provided via
these release notes and the GenBank newsgroup.
1.4.6 Feature location syntax X.Y to be discontinued
The Feature Table currently supports feature locations of the
format X.Y, to represent a base position which is greater or
equal to X, and less than or equal to Y. For example:
In the first example, the misc_feature starts somewhere between
bases 1 and 10 (inclusive), and ends at basepair 20. In the second,
the 51 bases from 100..150 are joined together with a second basepair
interval, which could be anywhere from 200..250 to 210..250 .
Although this syntax seems like a reasonable way to capture an
uncertain interval, it is used for features on a vanishingly small
number of sequence records, most database submission mechanisms
don't support it, and the meaning of its use in a join() context
is not entirely clear.
As of October 2006, this type of location will no longer be
supported. Those records with features which utilize X.Y locations
will be reviewed and converted to a non-uncertain format prior to
1.4.7 /operon to become legal for rRNA features
With the October 2006 GenBank release, the /operon qualifier will
be legal for use on rRNA features.
More information about the Genbankb