IUBio

[Genbank-bb] GenBank Release 156.0 Now Available

Mark Cavanaugh via genbankb%40net.bio.net (by cavanaug from ncbi.nlm.nih.gov)
Tue Oct 17 21:18:46 EST 2006


Greetings GenBank Users,

  GenBank Release 156.0 is now available via FTP from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 156.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 156.0

  Close-of-data for GenBank 156.0 occured on 10/12/2006. Uncompressed, the
Release 156.0 flatfiles require roughly 235 GB (sequence files only)
or 245 GB (including the 'short directory', 'index' and the *.txt files). 
The ASN.1 data require approximately 204 GB.

Recent statistics for non-WGS sequences:

  Release  Date       Base Pairs   Entries

  155      Aug 2006   65369091950  61132599
  156      Oct 2006   66925938907  62765195

And for WGS sequences:

  Release  Date        Base Pairs   Entries

  155      Aug 2006    80369977826  17960667
  156      Oct 2006    81127502509  18500772

  During the 51 days between the close dates for GenBank Releases 155.0
and 156.0, the non-WGS portion of GenBank grew by 1,556,846,957 basepairs
and by 1,632,596 sequence records. During that same period, 1,191,497 records
were updated. An average of about 55,400 non-WGS records were added and/or
updated per day.

  Between releases 155.0 and 156.0, the WGS component of GenBank grew by
757,524,683 basepairs and by 540,105 sequence records.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 156.0 and Upcoming Changes) have been appended
below.
		** Important Note #1 **

  A new protein residue abbreviation for the 22nd naturally occurring
amino acid, pyrrolysine, becomes legal for GenBank protein sequences
as of this GenBank Release (Release 156.0). Please see Section 1.3.4 of
the release notes for further information.

  Sample ASN.1, FASTA, GenBank flatfile, and INSDSeq XML files for CP000099,
which has a protein with a pyrrolysine residue, are available for testing
purposes at the NCBI FTP site:

	ftp://ftp.ncbi.nih.gov/genbank/Pyrrolysine_Samples

	Files:

	CP000099.pse    (print-form ASN.1 Seq-entry)
	CP000099.gbff   (GenBank flatfile)
	CP000099.aa_fsa (protein FASTA)
	CP000099.isx    (INSDSeq XML)

		** Important Note #2 **

  After recent problems generating the 'index' files which normally
accompany GenBank Releases, these files are once again being provided,
though without any EST content, and without most GSS content. See Sections
1.3.2 and 1.3.3 for further details. NCBI is considering ceasing support for
the index files, so we strongly encourage affected users to review these
sections and provide feedback.

  Release 156.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: October 15 2006, 156.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :

	set files = `ls gb*.*`
	foreach i ($files)
		head -10 $i | grep Release
	end

Or, if the files are compressed, perhaps:

	gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
156.0, please send email outlining your difficulties to:

	info from ncbi.nlm.nih.gov

Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman
GenBank
NCBI/NLM/NIH/HHS


1.3 Important Changes in Release 156.0

1.3.1 Organizational changes

  The total number of sequence data files increased by 23 with this release:

  - the BCT division is now comprised of  17 files (+1)
  - the EST division is now comprised of 554 files (+8)
  - the GSS division is now comprised of 190 files (+9)
  - the HTG division is now comprised of  86 files (+2)
  - the PAT division is now comprised of  26 files (+1)
  - the PLN division is now comprised of  19 files (+1)
  - the VRT division is now comprised of  12 files (+1)

1.3.2 Index files gbjou.idx and gbkey.idx not available.

  Problems were once again encountered generating the journal and keyword
'index' files, in spite of the recent changes which limit their content to
non-EST and non-GSS records (see Section 1.3.3 for a description of those
changes).

  Another attempt to resolve this issue will be made prior to GenBank 157.0
Our apologies for any inconvenience that this may cause.

1.3.3 Changes in the content of index files

  As described in the GB 153 release notes, the 'index' files which accompany
GenBank releases (see Section 3.3) are considered to be a legacy data product by
NCBI, generated mostly for historical reasons. FTP statistics since January 2005
seem to support this: the index files are transferred only half as frequently as
the files of sequence records. The inherent inefficiencies of the index file
format also lead us to suspect that they have little serious use by the user
community, particularly for EST and GSS records.

  The software that generated the index file products received little
attention over the years, and finally reached its limitations in
February 2006 (Release 152.0). The required multi-server queries which
obtained and sorted many millions of rows of terms from several different
databases simply outgrew the capacity of the hardware used for GenBank
Release generation.

  Our short-term solution is to cease generating some index-file content
for all EST sequence records, and for GSS sequence records that originate
via direct submission to NCBI. GenBank 156.0 thus contains these index
files, which lack all EST and most GSS content:

	gbaut1.idx
	gbaut2.idx
	gbaut3.idx
	gbaut4.idx
	gbaut5.idx
	gbaut6.idx
	gbaut7.idx
	gbaut8.idx
	gbgen.idx
	gbsec.idx

  We intend to provide similarly-restricted gbjou.idx and gbkey.idx index
files, but could not do so for this release. 

  The gbacc.idx index file continues to reflect the entirety of the release,
including all EST and GSS records, however the file's contents are unsorted.

  On a positive note, sequence version numbers have have been restored to
the accession number 'index' file as of GenBank 156.0 .

  These 'solutions' are really just stop-gaps, and we will likely pursue
one of two options within the next year:

a) Cease support of the 'index' file products altogether.

b) Provide new products that present some of the most useful data from
   the legacy 'index' files, and cease support for other types of index data.

  If you are a user of the 'index' files associated with GenBank files, we
encourage you to make your wishes known, either via the GenBank newsgroup,
or via email to NCBI's Service Desk:

   info from ncbi.nlm.nih.gov

  Our apologies for any inconvenience that these changes may cause.

1.3.4 New protein residue abbreviation for Pyrrolysine

  Sequence databases use single-letter amino acid abbreviations to
record the primary structure (sequence) of amino acids in a polypeptide.
The table of abbreviations includes only those amino acids that are
encoded in the genetic code and directly inserted by a tRNA during the
process of protein translation.  Post-translational modifications are
not represented in the sequence data itself, but may be described by
features annotated on the sequence.

  The discovery of the 22nd naturally encoded amino acid, pyrrolysine,
and the recent submission of sequence records that should contain
this residue, require the adoption of a new amino acid abbreviation.
Because several letters are assigned to represent different experimental
ambiguities, the only letter still available for use is O (uppercase
letter o).  Scientists working in the field have independently suggested
use of this letter, and it has a reasonable mnemonic, pyrrOlysine.

  The IUPAC-IUBMB Joint Commission on Biochemical Nomenclature has agreed
that Pyl/O will be recommended for this amino acid.

  The consequences for flatfile users are that O can now appear in CDS
/translation qualifiers, and that Pyl (the three-letter abbreviation)
can appear in CDS /transl_except qualifiers and in the /product and
/anticodon qualifiers of tRNA features. These changes are legal as of this
October 2006 GenBank Release.

  Sample ASN.1, FASTA, GenBank flatfile, and INSDSeq XML files for CP000099,
which has a protein with a pyrrolysine residue, are available for testing
purposes at the NCBI FTP site:

	ftp://ftp.ncbi.nih.gov/genbank/Pyrrolysine_Samples

	Files:

	CP000099.pse    (print-form ASN.1 Seq-entry)
	CP000099.gbff   (GenBank flatfile)
	CP000099.aa_fsa (protein FASTA)
	CP000099.isx    (INSDSeq XML)

  For BLAST and other sequence similarity search tools, we expect to map
pyrrolysine (O) to unknown (X), as is already done with selenocysteine
(U), the 21st naturally encoded amino acid.  One reason is that the PAM
and BLOSUM substitution matrices do not accommodate these more recently
discovered amino acids.  The other reason is that selenocysteine and
pyrrolysine both appear to be used as active sites in certain enzymes,
and thus do not simply substitute for cysteine or lysine.

  Here are a few literature references which provide more information
about pyrrolysine :

  G. Srinivasan, C. M. James, J. A. Krzycki.  Pyrrolysine encoded by
  UAG in Archaea: charging of a UAG-decoding specialized tRNA.  Science
  2002, 296:1459-1462.

  B. Hao, W. Gong, T.K. Ferguson, C.M. James, J.A. Krzycki, M.K.
  Chan.  A new UAG-encoded residue in the structure of a methanogen
  methyltransferase.  Science 2002, 296:1462-1466.

  C. Polycarpo, A. Ambrogelly, A. Berube, S.M. Winbush, J.A.
  McCloskey, P. F. Crain, J. L. Wood, D. Soll.  An aminoacyl-tRNA
  synthetase that specifically activates pyrrolysine.  Proc. Natl. Acad.
  Sci. (USA) 2004, 101:12450-12454.

  C. Fenske, G.J. Palm, W. Hinrichs.  How unique is the genetic code?
  Agnew. Chem. Int. Ed. 2003, 42:606-610.

1.3.5 Protein residue J for leucine/isoleucine ambiguities

  The residue abbreviation J is reserved for mass spectrometry experiments that
cannot distinguish leucine from isoleucine. Although this abbreviation has
been part of the IUPAC recommendations for some time, it has not previously
appeared in protein sequences in the GenBank database.

  As of October 2006, abbreviation J is legal in CDS /translation qualifiers,
and Xle (the three-letter abbreviation) will be allowed in CDS /transl_except
qualifiers and in the /product and /anticodon qualifiers of tRNA features.

  J will also be mapped to unknown (X) for the purpose of BLAST and other
sequence similarity search tools.

1.3.6 /PCR_primers and modified bases

  PCR primers are sometimes constructed which utilize modified bases,
such as those listed in the table of modified bases included in the
Feature Table document:

	http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html#7.5.2

As of October 2006, it is legal to use modified-base abbreviations for the
/PCR_primers qualifier. For example:

         /PCR_primers="fwd_seq: gcagtt<i>caag<gal q>tggagtgaa, rev_seq:
         gcaacgtatcctccagagtgatcgb

Here, modified bases inosine and beta,D-galactosylqueosine are included
in the forward sequence of the primer pair, and enclosed between angle
brackets ( <...> ) .

Each pair of angle brackets will include only a single modified base
abbreviation.

1.3.7 New /mol_type value

  A new legal molecule type value for viral cRNA sequences becomes
valid as of this October 2006 release:

	/mol_type="viral cRNA"

  This value will also be legal for the molecule type field on the
LOCUS line of the GenBank flatfile format. Additional details about
the usage of this new molecule type value will be provided via
these release notes and the GenBank newsgroup.

1.3.8 Feature location syntax X.Y no longer supported

  The Feature Table has supported feature locations of the form
'X.Y', to represent a base position which is greater or equal to X,
and less than or equal to Y. For example:

	misc_feature    1.10..20
	misc_feature    join(100..150,200.210..250)

  In the first example, the misc_feature starts somewhere between
bases 1 and 10 (inclusive), and ends at basepair 20. In the second,
the 51 bases from 100..150 are joined together with a second basepair
interval, which could be anywhere from 200..250 to 210..250 .

  Although this syntax seems like a reasonable way to capture an
uncertain interval, it is used for features on a vanishingly small
number of sequence records, most database submission mechanisms
don't support it, and the meaning of its use in a join() context
is not entirely clear.

  As of October 2006, this type of location is no longer supported.
Those records with features which utilize X.Y locations will be reviewed
and converted to a non-uncertain format.

1.3.9 /operon on rRNA features

  With this October 2006 GenBank release, the /operon qualifier may be
used for rRNA features.

1.3.10 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for
thirty-five of the GSS flatfiles in Release 156.0. Consider gbgss156.seq :

GBGSS1.SEQ           Genetic Sequence Data Bank
                          October 15 2006

                NCBI-GenBank Flat File Release 156.0

                           GSS Sequences (Part 1)

   86835 loci,    64398688 bases, from    86835 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "156" based on the number of files dumped from the other
system.  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

1.4.1 Introduction of /mobile_element qualifier

  For repeat_region features, the /transposon and /insertion_seq
qualifiers can be used to describe two specific classes of mobile
elements. But not all mobile elements fall into these two categories,
so a new structured /mobile_element qualifier will be introduced
as of GenBank 157.0 in December 2006. The preliminary description
of the new qualifier is as follows:

  Qualifer: /mobile_element

  Description: Type, and name (or identifier), of the mobile element
  which is described by the parent feature.

  Value format: <mobile_element_type>:<mobile_element_id>
  Where mobile element type is one of the following: transposon,
  integron, insertion_sequence, other .

  Example: /mobile_element="transposon:Tnp9"

  Further details about this new qualifier, the domain of mobile element
types in particular, will be provided in these release notes and via the
GenBank newsgroup as they become available.



More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net