[Genbank-bb] GenBank Release 155.0 Now Available

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Fri Aug 25 17:45:38 EST 2006

Greetings GenBank Users,

  GenBank Release 155.0 is now available via FTP from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 155.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 155.0

  Close-of-data for GenBank 155.0 occured on 08/21/2006. Uncompressed, the
Release 155.0 flatfiles require roughly 230 GB (sequence files only)
or 240 GB (including the 'short directory', 'index' and the *.txt files). 
The ASN.1 data require approximately 199 GB.

Recent statistics for non-WGS sequences:

  Release  Date       Base Pairs   Entries

  154      Jun 2006   63412609711  58890345
  155      Aug 2006   65369091950  61132599

And for WGS sequences:

  Release  Date        Base Pairs   Entries

  154      Jun 2006    78858635822  17733973
  155      Aug 2006    80369977826  17960667

  During the 73 days between the close dates for GenBank Releases 154.0
and 155.0, the non-WGS portion of GenBank grew by 1,956,482,239 basepairs
and by 2,242,254 sequence records. During that same period, 2,992,012 records
were updated. An average of about 71,700 non-WGS records were added and/or
updated per day.

  Between releases 154.0 and 155.0, the WGS component of GenBank grew by
1,511,342,004 basepairs and by 226,694 sequence records.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 155.0 and Upcoming Changes) have been appended
		** Important Note #1 **

  A new protein residue abbreviation for the 22nd naturally occurring
amino acid, pyrrolysine, will become legal in GenBank protein sequences
as of October 2006 (Release 156.0). Please see Section 1.4.1 of the release
notes for further information.

		** Important Note #2 **

  After recent problems generating the 'index' files which normally
accompany GenBank Releases, these files are once again being provided,
though without any EST content, and without most GSS content. See Section
1.3.3 for further details. NCBI is considering ceasing support for the
index files, so we strongly encourage affected users to review that section
and provide feedback.

  Release 155.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: April 15 2006, 155.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :

	set files = `ls gb*.*`
	foreach i ($files)
		head -10 $i | grep Release

Or, if the files are compressed, perhaps:

	gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
155.0, please send email outlining your difficulties to:

	info at ncbi.nlm.nih.gov

Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman

1.3 Important Changes in Release 155.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 31 with this release:

  - the BCT division is now comprised of  16 files (+1)
  - the EST division is now comprised of 546 files (+18)
  - the ENV division is now comprised of   4 files (+1)
  - the GSS division is now comprised of 181 files (+4)
  - the HTC division is now comprised of  12 files (+2)
  - the HTG division is now comprised of  84 files (+1)
  - the MAM division is now comprised of   3 files (+1)
  - the PAT division is now comprised of  25 files (+1)
  - the PRI division is now comprised of  31 files (+1)
  - the ROD division is now comprised of  25 files (+1)

  Reminder: the Short-Directory 'index' file is now split into
  three pieces, as of GB 154 :

  gbsdr1.txt : non-EST and non-GSS short directory entries
  gbsdr2.txt : EST short directory entries
  gbsdr3.txt : GSS short directory entries

1.3.2 Index files gbjou.idx and gbkey.idx not available.

  Problems were encountered generating the journal and keyword 'index' files,
in spite of the recent changes which limit their content to non-EST and non-GSS
records (see Section 1.3.3 for a description of those changes).

  Because Release 155.0 was already late due to other (unrelated) issues, we
are making this release available without gbjou.idx and gbkey.idx . If possible,
we will provide them within a few days of 155.0's availability, and let users
know via the GenBank listserv.

1.3.3 Changes in the content of index files

  As described in the GB 153 release notes, the 'index' files which accompany
GenBank releases (see Section 3.3) are considered to be a legacy data product by
NCBI, generated mostly for historical reasons. FTP statistics since January 2005
seem to support this: the index files are transferred only half as frequently as
the files of sequence records. The inherent inefficiencies of the index file
format also lead us to suspect that they have little serious use by the user
community, particularly for EST and GSS records.

  The software that generated the index file products received little
attention over the years, and finally reached its limitations in
February 2006 (Release 152.0). The required multi-server queries which
obtained and sorted many millions of rows of terms from several different
databases simply outgrew the capacity of the hardware used for GenBank
Release generation.

  Our short-term solution is to cease generating index-file content
for all EST sequence records, and for GSS sequence records that originate
via direct submission to NCBI. GenBank 155.0 thus contains these ten index
files, which lack all EST and most GSS content:


  In addition, a version of gbacc.idx which encompasses the entirety of the
release was built manually, but note that the first field contains just an
accession number, rather than Accession.Version, and that the file is unsorted.

  These 'solutions' are really just stop-gaps, and we will likely pursue
one of two options within the next year:

a) Cease support of the 'index' file products altogether.

b) Provide new products that present some of the most useful data from
   the legacy 'index' files, and cease support for other types of index data.

  If you are a user of the 'index' files associated with GenBank files, we
encourage you to make your wishes known, either via the GenBank newsgroup,
or via email to NCBI's Service Desk:

   info at ncbi.nlm.nih.gov

  Our apologies for any inconvenience that these changes may cause.

1.3.4 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for
thirty-three of the GSS flatfiles in Release 155.0. Consider gbgss149.seq :

GBGSS1.SEQ           Genetic Sequence Data Bank
                           August 15 2006

                NCBI-GenBank Flat File Release 155.0

                           GSS Sequences (Part 1)

   86836 loci,    64404216 bases, from    86836 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "149" based on the number of files dumped from the other
system.  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 New protein residue abbreviation for Pyrrolysine

  Sequence databases use single-letter amino acid abbreviations to
record the primary structure (sequence) of amino acids in a polypeptide.
The table of abbreviations includes only those amino acids that are
encoded in the genetic code and directly inserted by a tRNA during the
process of protein translation.  Post-translational modifications are
not represented in the sequence data itself, but may be described by
features annotated on the sequence.

  The discovery of the 22nd naturally encoded amino acid, pyrrolysine,
and the recent submission of sequence records that should contain
this residue, require the adoption of a new amino acid abbreviation.
Because several letters are assigned to represent different experimental
ambiguities, the only letter still available for use is O (uppercase
letter o).  Scientists working in the field have independently suggested
use of this letter, and it has a reasonable mnemonic, pyrrOlysine.

  The IUPAC-IUBMB Joint Commission on Biochemical Nomenclature has agreed
that Pyl/O will be recommended for this amino acid.

  The consequences for flatfile users are that O will appear in CDS
/translation qualifiers, and that Pyl (the three-letter abbreviation)
will appear in CDS /transl_except qualifiers and in the /product and
/anticodon qualifiers of tRNA features. These changes will take effect
as of the October 2006 GenBank release.

  Sample records in ASN.1, FASTA, GenBank flatfile, and INSDSeq XML
formats will be made available on the NCBI ftp site for the purpose of
testing software prior to the public introduction of 'O' in protein

  For BLAST and other sequence similarity search tools, we expect to map
pyrrolysine (O) to unknown (X), as is already done with selenocysteine
(U), the 21st naturally encoded amino acid.  One reason is that the PAM
and BLOSUM substitution matrices do not accommodate these more recently
discovered amino acids.  The other reason is that selenocysteine and
pyrrolysine both appear to be used as active sites in certain enzymes,
and thus do not simply substitute for cysteine or lysine.

  Here are a few literature references which provide more information
about pyrrolysine :

  G. Srinivasan, C. M. James, J. A. Krzycki.  Pyrrolysine encoded by
  UAG in Archaea: charging of a UAG-decoding specialized tRNA.  Science
  2002, 296:1459-1462.

  B. Hao, W. Gong, T.K. Ferguson, C.M. James, J.A. Krzycki, M.K.
  Chan.  A new UAG-encoded residue in the structure of a methanogen
  methyltransferase.  Science 2002, 296:1462-1466.

  C. Polycarpo, A. Ambrogelly, A. Berube, S.M. Winbush, J.A.
  McCloskey, P. F. Crain, J. L. Wood, D. Soll.  An aminoacyl-tRNA
  synthetase that specifically activates pyrrolysine.  Proc. Natl. Acad.
  Sci. (USA) 2004, 101:12450-12454.

  C. Fenske, G.J. Palm, W. Hinrichs.  How unique is the genetic code?
  Agnew. Chem. Int. Ed. 2003, 42:606-610.

1.4.2 Protein residue J for leucine/isoleucine ambiguities

  The residue abbreviation J is reserved for mass spectrometry experiments that
cannot distinguish leucine from isoleucine. Although this abbreviation has
been part of the IUPAC recommendations for some time, it has not previously
appeared in protein sequences in the GenBank database.

  As of October 2006, abbreviation J will be legal in CDS /translation
qualifiers, and Xle (the three-letter abbreviation) will be allowed in CDS
/transl_except qualifiers and in the /product and /anticodon qualifiers of
tRNA features.

  J will also be mapped to unknown (X) for the purpose of BLAST and other
sequence similarity search tools.

1.4.3 /PCR_primers and modified bases

  PCR primers are sometimes constructed which utilize modified bases,
such as those listed in the table of modified bases included in the
Feature Table document:


In October 2006, it will be legal to use modified-base abbreviations
for the /PCR_primers qualifier. For example:

         /PCR_primers="fwd_seq: gcagtt<i>caag<gal q>tggagtgaa, rev_seq:

Here, modified bases inosine and beta,D-galactosylqueosine are included
in the forward sequence of the primer pair, and enclosed between angle
brackets ( <...> ) .

Each pair of angle brackets will include only a single modified base

1.4.4 Introduction of /mobile_element qualifier

  For repeat_region features, the /transposon and /insertion_seq
qualifiers can be used to describe two specific classes of mobile
elements. But not all mobile elements fall into these two categories,
so a new structured /mobile_element qualifier will be introduced
as of GenBank 155.0 in December 2006. The preliminary description
of the new qualifier is as follows:

  Qualifer: /mobile_element

  Description: Type, and name (or identifier), of the mobile element
  which is described by the parent feature.

  Value format: <mobile_element_type>:<mobile_element_id>
  Where mobile element type is one of the following: transposon,
  integron, insertion_sequence, other .

  Example: /mobile_element="transposon:Tnp9"

  Further details about this new qualifier, the domain of mobile element
types in particular, will be provided in these release notes and via the
GenBank newsgroup as they become available.

1.4.5 New /mol_type value

  A new legal molecule type value for viral cRNA sequences will be
introduced as of October 2006:

	/mol_type="viral cRNA"

  This value will also be legal for the molecule type field on the
LOCUS line of the GenBank flatfile format. Additional details about
the usage of this new molecule type value will be provided via
these release notes and the GenBank newsgroup.

1.4.6 Feature location syntax X.Y to be discontinued

  The Feature Table currently supports feature locations of the
format X.Y, to represent a base position which is greater or
equal to X, and less than or equal to Y. For example:

	misc_feature    1.10..20
	misc_feature    join(100..150,200.210..250)

  In the first example, the misc_feature starts somewhere between
bases 1 and 10 (inclusive), and ends at basepair 20. In the second,
the 51 bases from 100..150 are joined together with a second basepair
interval, which could be anywhere from 200..250 to 210..250 .

  Although this syntax seems like a reasonable way to capture an
uncertain interval, it is used for features on a vanishingly small
number of sequence records, most database submission mechanisms
don't support it, and the meaning of its use in a join() context
is not entirely clear.

  As of October 2006, this type of location will no longer be 
supported. Those records with features which utilize X.Y locations
will be reviewed and converted to a non-uncertain format prior to
that date.

1.4.7 /operon to become legal for rRNA features

  With the October 2006 GenBank release, the /operon qualifier will
be legal for use on rRNA features.

More information about the Genbankb mailing list