GenBank Release 119.0 Available

Mark Cavanaugh cavanaug at lagrange.nlm.nih.gov
Sat Aug 19 03:37:54 EST 2000


Greetings GenBank Users,

  GenBank Release 119.0 is now available via ftp from the National Center
for Biotechnology Information:

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------
  ncbi.nlm.nih.gov   genbank     GenBank Release 119.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 119.0

  Uncompressed, the Release 119.0 flatfiles require roughly 33201 MB
(sequence files only) or 38000 MB (including the 'index' files).  The
ASN.1 version requires roughly 29799 MB. From the release notes:

   Release  Date       Base Pairs   Entries

   118      Jun 2000   8604221980   7077491
   119      Aug 2000   9545724824   8214339

  Close-of-data was 08/10/2000. Eight days were required to prepare this
release. In the seven-week period between close-of-data for GenBank 118.0
and GenBank 119.0, GenBank grew by 0.942 billion basepairs and 1,136,848
sequence records.

  For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 119.0 and Upcoming Changes) have been appended below.

  PLEASE NOTE: We have experienced some problems building the new-format
'index' files (gbaut.idx, etc) which are described in Section 1.3.7 of the
release notes. Rather than delay the installation of GenBank 119.0, we
have chosen to make the release available without the indexes. The index
files should be available by Tuesday, August 22, 2000. A followup newsgroup
posting will be made at that time.

  PLEASE NOTE: There is no longer a one-to-one correspondence between the
ASN.1 files and the GenBank flatfiles which comprise GenBank releases.
NCBI uses ASN.1 as the data representation for its sequence records. A
new database now stores the flatfile "image" of every ASN.1 record, as
they are added or updated. Consequently, release generation now involves dumps
from two separate systems (ASN.1 vs flatfile). Because of this, and because
the sizes of an ASN.1 record can differ greatly from the size of a flatfile,
the content of each output file cannot be equivalent. This is why there are
26 HTG ASN.1 files and only 23 HTG flatfiles, for example.

  Release 119.0 data are currently available via NCBI's Entrez and Blast
servers, and the 'query' email server.

  New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 119.0 close-of-data, should be
available by 10:00am EDT, August 19. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 118.0 was
posted.

  If you encounter problems while ftp'ing or uncompressing Release 119.0,
please send email outlining your difficulties to info at ncbi.nlm.nih.gov .

Mark Cavanaugh, Vladimir Alekseyev, Anton Butanaev
GenBank
NCBI/NLM/NIH


1.3 Important Changes in Release 119.0

1.3.1 Organizational changes

  Due to database growth, the BCT division is now being split into three pieces.

  Due to database growth, the EST division is now being split into eighty pieces.

  Due to database growth, the GSS division is now being split into twenty-four
pieces.

  Due to database growth, the PRI division is now being split into seven pieces.

1.3.2 STS division newly split into multiple files

  The STS GenBank division is now split into multiple files, since its total size
exceeds 300MB. The files for the STS division are now: gbsts1.seq and gbsts2.seq .

1.3.3 Number of HTG data files has been reduced

  Quality score data for the sequences generated by the Human Genome Project
are in the process of being incorporated into GenBank records. This data
is stored within our ASN.1 representation, but does not appear in the
GenBank-format flatfiles of the HTG division. In the past, the basis for
splitting the HTG division into multiple pieces was the size of the ASN.1
representation, which led to a decrease in the average size of each piece
and consequently an unnecessarily large number of gbhtg*.seq flatfiles.

  For GenBank 119.0, the parameters for splitting the HTG division were 
adjusted to yield an average flatfile size of about 250 MB. This has reduced
the number of HTG files to twenty-three.
  
1.3.4 Change in compression method for GenBank Releases and Updates

  As announced via the GenBank newsgroup on June 15, NCBI now uses the
gzip compression utility instead of the Unix 'compress' utility for all
GenBank products, starting with GenBank Release 119.0 and the first GenBank
Updates made available after the release was placed on NCBI's ftp site.

 This is slightly different timing than was described in the GenBank 118.0
release notes: the first of the GenBank Updates to use gzip will be nc0819
rather than nc0815.

  Comparisons of gzip to compress for simplistic sequence data (eg, EST,
GSS, STS) yielded an additional 50% reduction in the size of a compressed
file. Given that ESTs and GSS sequences comprise a huge portion of the
GenBank data NCBI distributes, switching to gzip saves a great deal
of disk space, and reduces the amount of bandwidth utilized by those who
ftp GenBank products.

  As a result of the switch to gzip, file naming conventions have changed.
The suffix of compressed GenBank data files is currently ".Z" . After the
switch, the suffix will become ".gz" . For example:

	gbbct1.seq.Z -> gbbct1.seq.gz
	gbcu.flat.Z  -> gbcu.flat.gz
	nc0610.aso.Z -> nc0610.aso.gz

  If you are unsure about the availability of gzip for your platform, please
contact your system administrator. If you find that the utility is not
installed, one possible place for obtaining gzip is:

	http://www.gnu.org/software/gzip/gzip.html

  Any questions or concerns that you have about this change should be directed
to NCBI's Service Desk:

	info at ncbi.nlm.nih.gov

1.3.5 File-naming convention for ASN.1 data files has changed.

  The filename convention for the ASN.1 data files used to create GenBank
flatfile releases has been changed. These ASN.1 files can be found at the NCBI
ftp site:

    ftp://ncbi.nlm.nih.gov/ncbi-asn1/

  The naming convention has been changed so that the ASN.1 filenames and
the GenBank flatfile names match more closely:

    gbDIV-CODE.aso.gz

For example:

    gbbct1.aso.gz
    gbbct2.aso.gz

1.3.6 New PUBMED linetype for REFERENCEs

  The PUBMED linetype is now legal for the GenBank flatfile format as of
GenBank Release 119.0, and will soon begin to appear in GenBank Update files.
The PUBMED line will be located in the REFERENCE block of GenBank flatfiles:

LOCUS       AF245949      558 bp    RNA             VRL       30-APR-2000
DEFINITION  Hepatitis C virus isolate P11 clone A41 polyprotein precursor,
            E1/E2 region, gene, partial cds.
ACCESSION   AF245949
VERSION     AF245949.1  GI:7670856
....
REFERENCE   1  (bases 1 to 558)
  AUTHORS   Farci,P., Shimoda,A., Coiana,A., Diaz,G., Peddis,G.,
            Melpolder,J.C., Strazzera,A., Chien,D.Y., Munoz,S.J.,
            Balestrieri,A., Purcell,R.H. and Alter,H.J.
  TITLE     The outcome of acute hepatitis C predicted by the evolution of the
            viral quasispecies
  JOURNAL   Science 288 (5464), 339-344 (2000)
  MEDLINE   20230065
   PUBMED   10764648

  The PUBMED identifier is the record identifier for article abstracts
in the PubMed database :

       http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

  Abstracts in PubMed that do not fall within Medline's scope will have only
a PUBMED identifier. Similarly, abstracts that *are* in Medline's scope but
which have not yet been assigned Medline UIs will have only a PUBMED identifier.
If an abstract is present in both the PubMed and Medline databases, both Medline UI
and PubMed ID will be provided.

1.3.7 New format for GenBank Index files

  Starting with GenBank Release 119.0, the format of the "index" files for
releases has been changed from a tabular, fixed-column format to a TAB-delimited,
line-oriented format. The header information at the start of index files is no
longer provided.

  In the past, index file entries consisted of a line containing the indexed
term, followed by a table containing LOCUS/DIVISION/ACCESSION triplets.
For example:

GBKEY.IDX          Genetic Sequence Data Bank
                         15 April 2000

               NCBI-GenBank Flat File Release 117.0

                       Keyword Phrase Index

 6215002 loci,  7376080723 bases, from 6215002 reported sequences
....
ZONA PELLUCIDA 2 GLYCOPROTEIN
             AB000929   ROD AB000929 CATFZP2G   MAM D45067 CJZPG2     PRI Y10767
             DOGCZP2G   MAM D45069 MRZPG2     PRI Y10690 PIGPZP2G   MAM D45064

  Notice that the "fixed" format is was broken due to the presence of
eight-character accession numbers. Rather than define a new fixed format
that will break at some point in the future, and at the expense of slightly
larger files, the new index files for the above example will look like so:

ZONA PELLUCIDA 2 GLYCOPROTEIN
	   AB000929   ROD AB000929
	   CATFZP2G   MAM D45067
	   CJZPG2     PRI Y10767
	   DOGCZP2G   MAM D45069
	   MRZPG2     PRI Y10690
	   PIGPZP2G   MAM D45064

A series of LOCUS/DIVISION/ACCESSION triplets, TAB-delimited (and with
a leading TAB), one per line, will follow each indexed value.

  Additional details about the changes to the index files were provided
via the GenBank newsgroup (bionet.molbio.genbank):

   http://www.bio.net/hypermail/genbankb/genbankb.200007/0000.html

1.4 Upcoming Changes

1.4.1 Order of entries in the Short Directory file will change

  The gbsdr.txt file which accompanies GenBank releases contains the
DEFINITION line and number of bases for every sequence. For historical
reasons, a specific division ordering has always been used for the
sections of this file. This order begins with:

	PRIMATE
	RODENT
	OTHER MAMMALIAN
	OTHER VERTEBRATE
	INVERTEBRATE
	....

and ends with:

	SEQUENCE TAGGED SITE
	GENOME SURVEY SEQUENCE
	HIGH THROUGHPUT GENOMIC SEQUENCING

  Starting with GenBank Release 120.0, the ordering of the sections
of this file will be determined solely by the sort-order of the 
GenBank divisions codes. For example:

	bct1
	bct2
	...
	est1
	...
	gss1
	...
	htg1
	...
	vrl1
	vrl2
	vrt

1.4.2 The gbaut.idx file will be split into multiple pieces

  The gbaut.idx file now exceeds 3 GB in size, so it will be split into
pieces of a more manageable size starting with GenBank Release 120.0 in
October, 2000.

1.4.3 NCBI's ftp address will be changed

  At some point in the near future (but not sooner than two months from now),
NCBI's ftp address will be changed. The current address:

	ncbi.nlm.nih.gov

will become:

	ftp.ncbi.nih.gov

  Additional details about this change will be made available via these
release notes and the GenBank newsgroup (bionet.molbio.genbank) as they
become available.

1.4.4 Selenocysteine representation

  Selenocysteine residues within the protein translations of coding
region features have been represented in GenBank via the letter 'X'
and a /transl_except qualifier. At the May 1999 DDBJ/EMBL/GenBank
collaborative meeting, it was learned that IUPAC plans to adopt the
letter 'U' for selenocysteine.

  DDBJ, EMBL, and GenBank will thus use this new amino acid abbreviation
for its /translation qualifiers. Although a timetable for its appearance
has not been finalized, we are mentioning this now because the introduction
of a new residue abbreviation is a fairly fundamental change.

  Details about the use of 'U' will be made available via these release
notes and the GenBank newsgroup as they become available.

1.4.5 New REFERENCE type for on-line journals

  Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that an effort should be made to accomodate references which are
published only on-line. Until specifications for such references are
available from library organizations, GenBank will present them in a manner
like this:

	REFERENCE   1  (bases 1 to 2858)
	  AUTHORS   Smith, J.
	  TITLE     Cloning and expression of a phospholipase gene
	  JOURNAL   Online Publication
	  REMARK    Online-Journal-name; Article Identifier; URL

  This format is still tentative; additional information about this new
reference type will be made available via these release notes.


---


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  








More information about the Genbankb mailing list