IUBio

[Genbank-bb] GenBank Release 147.0 Now Available

genbankb at iubio.bio.indiana.edu genbankb at iubio.bio.indiana.edu
Tue Apr 26 22:05:24 EST 2005


Greetings GenBank Users,

  GenBank Release 147.0 is now available via ftp from the National
Center for Biotechnology Information (NCBI):

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ftp.ncbi.nih.gov   genbank     GenBank Release 147.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 147.0

  Close-of-data was 04/20/2005. Five business days were required to build
Release 147.0. Uncompressed, the Release 147.0 flatfiles require approximately
168 GB (sequence files only) or 185 GB (including the 'short directory' and
'index' files).  The ASN.1 version requires approximately 145 GB. From
the release notes:

   Release  Date       Base Pairs   Entries

   146      Feb 2005   46849831226  42734478
   147      Apr 2005   48235738567  44202133

In the nine week period between the close dates for GenBank Releases 146.0
and 147.0, the non-WGS portion of GenBank grew by 1,386,557,002 basepairs
and by 1,467,655 sequence records. During that same period, 339,590 records
were updated. Combined, this yields an average of about 28,690 new and/or
updated records per day.

  Between releases 146.0 and 147.0, the WGS component of GenBank grew by
1,446,908,262 basepairs and by 573,964 sequence records.

  For additional release information, see the README files in either of
the directories mentioned above, and the release notes (gbrel.txt) in
the genbank directory. Sections 1.3 and 1.4 of the release notes
(Changes in Release 147.0 and Upcoming Changes) have been appended
below.

  **NOTE** Problems were encountered generating the gbacc.idx and
gbkey.idx 'index' files that accompany GenBank Releases. See Section
1.3.3 for further details.

  Release 147.0 data, and subsequent updates, are available now via
NCBI's Entrez and Blast services.

  As a general guideline, we suggest first transferring the GenBank release
notes (gbrel.txt) whenever a release is being obtained. Check to make sure
that the date and release number in the header of the release notes are
current (eg: April 15 2005, 147.0). If they are not, interrupt the
remaining transfers and then request assistance from the NCBI Service Desk.

  A comprehensive check of the headers of all release files after your
transfers are complete is also suggested. Here's how one might go about
this on a unix platform with csh/tcsh :

	set files = `ls gb*.*`
	foreach i ($files)
		head -10 $i | grep Release
	end

Or, if the files are compressed, perhaps:

	gzcat $i | head -10 | grep Release

  If you encounter problems while ftp'ing or uncompressing Release
147.0, please send email outlining your difficulties to:

	info at ncbi.nlm.nih.gov 

Mark Cavanaugh, Vladimir Alekseyev, Aleksey Vysokolov, Michael Kimelman
GenBank
NCBI/NLM/NIH


1.3 Important Changes in Release 147.0

1.3.1 ENV Division introduced with April 2005 release

  A new division for sequences obtained via environmental sampling methods
has been introduced with GenBank Release 147.0 . This new division segregates
128,571 sequences for which the source organism is unknown, or can only be
inferred by sequence comparison. The new sequence files are:

	gbenv1.seq
	gbenv2.seq

Records in the ENV division have these characteristics:

  1. ENV division code on the LOCUS line
  2. /environmental_sample qualifier for the source feature

And, as of Release 148.0 in June, these records will also have an ENV keyword.

  Note that sequences from WGS projects that involve environmental sampling
will *not* be distributed via this new division. All WGS projects continue
to be distributed using project-specific data files at the NCBI FTP site:

	ftp://ftp.ncbi.nih.gov/ncbi-asn1/wgs
	ftp://ftp.ncbi.nih.gov/genbank/wgs

1.3.2 Removal of MEDLINE linetype as of April 2005 release

The PUBMED linetype was introduced in December of 1997, as a means of
linking references in sequence records to the PubMed biomedical literature
database, based on a PubMed ID (PMID) .

Since then, we have been displaying both the PMID and its predecessor
(Medline Unique ID / MUID) for all references. For example :

LOCUS       ECOGUABA                3531 bp    DNA     linear   BCT
09-FEB-2005
DEFINITION  Escherichia coli guaBA operon operon, complete sequence.
ACCESSION   M10101 M10102
VERSION     M10101.1  GI:146274
....
REFERENCE   1  (bases 1768 to 3531)
  AUTHORS   Tiedeman,A.A., Smith,J.M. and Zalkin,H.
  TITLE     Nucleotide sequence of the guaA gene encoding GMP synthetase of
            Escherichia coli K12
  JOURNAL   J. Biol. Chem. 260 (15), 8676-8679 (1985)
  MEDLINE   85261223
   PUBMED   3894345

Subsequent to 1997, PMID article identifiers subsumed MUIDs. Some background
information about that evolution can be found at:

  http://www.nlm.nih.gov/pubs/techbull/mj01/mj01_medline_ui.html

Starting with GenBank Release 147.0, the older MEDLINE linetype is displayed
in GenBank sequence records only for very rare articles that lack a PMID
identifier.

For the vast majority of articles, this means that only the PUBMED identifier
is now presented.

1.3.3 Problems generating accession number and keyword indexes

  Software problems during Release 147.0 prevented the generation of
the gbacc.idx and gbkey.idx 'index' files which normally accompany
GenBank releases.

  A version of gbacc.idx was built manually. However, the first field
contains just an accession number rather than Accession.Version .

  The gbkey.idx index could not be created without substantial
additional delays in release processing, so it is completely absent
from 147.0 .

  Our apologies for any inconvenience that this may cause.

1.3.4 Organizational changes

  The total number of sequence data files increased by 25 with this release:

  - the ENV division is now comprised of   2 files (+2)
  - the EST division is now comprised of 388 files (+11)
  - the GSS division is now comprised of 142 files (+4)
  - the PLN division is now comprised of  16 files (+1)
  - the ROD division is now comprised of  18 files (+2)
  - the STS division is now comprised of  14 files (+5)

1.3.5 GSS File Header Problem

  GSS sequences at GenBank are maintained in two different systems, depending
on their origin, and the dumps from those systems occur in parallel. Because
the second dump (for example) has no prior knowledge of exactly how many GSS
files will be dumped from the first, it does not know how to number its own
output files.

  There is thus a discrepancy between the filenames and file headers for
twenty-five GSS flatfiles in Release 147.0. Consider gbgss117.seq :

GBGSS1.SEQ           Genetic Sequence Data Bank
                           April 15 2005

                NCBI-GenBank Flat File Release 147.0

                           GSS Sequences (Part 1)

   87197 loci,    64745577 bases, from    87197 reported sequences

  Here, the filename and part number in the header is "1", though the file
has been renamed as "117" based on the number of files dumped from the other
system.  We will work to resolve this discrepancy in future releases, but the
priority is certainly much lower than many other tasks.

1.4 Upcoming Changes

  No substantive changes are anticipated for GenBank Release 148.0 .

---




More information about the Genbankb mailing list

Send comments to us at biosci-help [At] net.bio.net