GenBank Release 118.0 Available From NCBI

Mark Cavanaugh cavanaug at
Fri Jun 23 05:02:10 EST 2000

Greetings GenBank Users,

  GenBank Release 118.0 is now available via ftp from the National Center
for Biotechnology Information:

  Ftp Site           Directory   Contents
  ----------------   ---------   ---------------------------------------   genbank     GenBank Release 118.0 flatfiles
                     ncbi-asn1   ASN.1 data used to create Release 118.0

  Uncompressed, the Release 118.0 flatfiles require roughly 28218 MB
(sequence files only) or 33916 MB (including the 'index' files).  The
ASN.1 version requires roughly 24152 MB. From the release notes:

   Release  Date       Base Pairs   Entries

   117      Apr 2000   7376080723   6215002
   118      Jun 2000   8604221980   7077491

  Close-of-data was 06/15/2000. Seven days were required to prepare this
release. In the seven-week period between close-of-data for GenBank 117.0
and GenBank 118.0, GenBank grew by 1.228 billion basepairs and 862,489
sequence records.

 PLEASE NOTE: Problems were encountered once again building the author-name
index file (gbaut.idx) for GenBank 118.0 . The version now available on
our ftp site, while complete (and exceeding 4.3 GB in size!), has not been
converted to the required tabular format. Given that a new index file format
will be implemented for GenBank 119.0, we are making gbaut.idx available in
its raw form rather than attempt to fix the old, soon to be obsolete,
software which performs the conversion. Our apologies for any inconvenience
that this causes.

 (See also Section 1.4.2 of the release notes, below)

 For additional release information, see the README files in either of the
directories mentioned above, and the release notes (gbrel.txt) in the
genbank directory. Sections 1.3 and 1.4 of the release notes (Changes in
Release 118.0 and Upcoming Changes) have been appended below.

  Release 118.0 data are currently available via NCBI's Entrez and Blast
servers, and the 'query' email server.

  New GenBank cumulative update files (gbcu.flat.Z and gbcu.aso.Z), containing
only those entries new/updated since the Release 118.0 close-of-data, should be
available by 10:00am EDT, June 23. Please note that the new CUs will be
smaller than previous versions you might have obtained after Release 117.0 was

  If you encounter problems while ftp'ing or uncompressing Release 118.0,
please send email outlining your difficulties to info at .

Mark Cavanaugh

1.4 Upcoming Changes

1.4.1 New PUBMED linetype for REFERENCEs

  Starting with GenBank Release 119.0 in August 2000, a new PUBMED
linetype will be legal for the REFERENCE block of GenBank flatfiles:

LOCUS       AF245949      558 bp    RNA             VRL       30-APR-2000
DEFINITION  Hepatitis C virus isolate P11 clone A41 polyprotein precursor,
            E1/E2 region, gene, partial cds.
VERSION     AF245949.1  GI:7670856
REFERENCE   1  (bases 1 to 558)
  AUTHORS   Farci,P., Shimoda,A., Coiana,A., Diaz,G., Peddis,G.,
            Melpolder,J.C., Strazzera,A., Chien,D.Y., Munoz,S.J.,
            Balestrieri,A., Purcell,R.H. and Alter,H.J.
  TITLE     The outcome of acute hepatitis C predicted by the evolution of the
            viral quasispecies
  JOURNAL   Science 288 (5464), 339-344 (2000)
  MEDLINE   20230065
   PUBMED   10764648

  The PUBMED identifier is the record identifier for article abstracts
in the PubMed database :

  Abstracts in PubMed that do not fall within Medline's scope will have only
a PUBMED identifier. Similarly, abstracts that *are* in Medline's scope but
which have not yet been assigned Medline UIs will have only a PUBMED identifier.
If an abstract is present in both the PubMed and Medline databases, both Medline UI
and PubMed ID will be provided.

1.4.2 New format for GenBank Index files

  Starting with GenBank Release 119.0 in August 2000, the format of the
"index" files for releases will change from a tabular, fixed-column
format to a TAB-delimited, line-oriented format. The header information
at the start of index files will no longer be provided.

  In general, index file entries consist of a line containing the indexed
term, followed by a table containing LOCUS/DIVISION/ACCESSION triplets.
For example:

GBKEY.IDX          Genetic Sequence Data Bank
                         15 April 2000

               NCBI-GenBank Flat File Release 117.0

                       Keyword Phrase Index

 6215002 loci,  7376080723 bases, from 6215002 reported sequences
             AB000929   ROD AB000929 CATFZP2G   MAM D45067 CJZPG2     PRI Y10767
             DOGCZP2G   MAM D45069 MRZPG2     PRI Y10690 PIGPZP2G   MAM D45064

  Notice that the "fixed" format is already broken, due to the presence of
eight-character accession numbers. Rather than define a new fixed format
that will break at some point in the future, and at the expense of slightly
larger files, the new index files for the above example will look like so:

	   AB000929   ROD AB000929
	   CATFZP2G   MAM D45067
	   CJZPG2     PRI Y10767
	   DOGCZP2G   MAM D45069
	   MRZPG2     PRI Y10690
	   PIGPZP2G   MAM D45064

A series of LOCUS/DIVISION/ACCESSION triplets, TAB-delimited (and with
a leading TAB), one per line, will follow each indexed value.

  Complete details about the changes to the index files will be provided
via the GenBank newsgroup (bionet.molbio.genbank) in late June.

1.4.3 STS division will be split into multiple files

  The STS GenBank division (gbsts.seq) will soon be split into multiple
files, since its size exceeds 300MB. Though the split did not occur for
GenBank 118.0 (because the STS division experienced only trivial growth
since 117.0), it will very likely occur by GenBank Release 119.0
(August 2000). The resulting files for the STS division will be: gbsts1.seq
and gbsts2.seq .

1.4.4 File-naming convention for ASN.1 data files will be changed.

  Starting with GenBank Release 119.0 in August 2000, the filename
convention for the ASN.1 data files used to create GenBank flatfile
releases will be changed. These ASN.1 files can be found at the NCBI
ftp site:

The naming convention for these files is currently:


For example:


  This convention will be changed so that the ASN.1 filenames and the
GenBank flatfile names match more closely:


For example:


1.4.5 Change in compression method for GenBank Releases and Updates

  As announced via the GenBank newsgroup on June 15, NCBI will use the
gzip compression utility instead of the Unix 'compress' utility for all
GenBank products starting on August 15, 2000. The nc0815 non-cumulative
update and the GenBank cumulative update of 8/15 will be the first
products to use gzip compression. When Release 119.0 processing is complete
about a week later, the files which comprise that release will also be
compressed with gzip.

  Comparisons of gzip to compress for simplistic sequence data (eg, EST,
GSS, STS) yielded an additional 50% reduction in the size of a compressed
file. Given that ESTs and GSS sequences comprise a huge portion of the
GenBank data NCBI distributes, switching to gzip will save a great deal
of disk space, and will reduce the amount of bandwidth utilized by those who
ftp GenBank products.

  As a result of the switch to gzip, file naming conventions will change.
The suffix of compressed GenBank data files is currently ".Z" . After the
switch, the suffix will become ".gz" . For example:

	gbbct1.seq.Z -> gbbct1.seq.gz
	gbcu.flat.Z  -> gbcu.flat.gz
	nc0610.aso.Z -> nc0610.aso.gz

  If you are unsure about the availability of gzip for your platform, please
contact your system administrator. If you find that the utility is not
installed, one possible place for obtaining gzip is:

  Any questions or concerns that you have about this change should be directed
to NCBI's Service Desk:

	info at

1.4.6 Planned reduction in the number of HTG data files

  Quality score data for the sequences generated by the Human Genome Project
are in the process of being incorporated into GenBank records. This data
is stored within our ASN.1 representation, but does not appear in the
GenBank-format flatfiles of the HTG division. Since the basis for splitting
the HTG division into multiple pieces is the size of the ASN.1 representation,
this has led to a decrease in the average size of each piece (currently about
116 MB), and consequently an unnecessarily large number of gbhtg*.seq flatfiles.

  For GenBank 119.0 in August of 2000, the parameters for splitting the HTG
division will be adjusted to yield an average flatfile size of about 250 MB.
This will reduce the number of HTG files by approximately 50%.
1.4.7 Selenocysteine representation

  Selenocysteine residues within the protein translations of coding
region features have been represented in GenBank via the letter 'X'
and a /transl_except qualifier. At the May 1999 DDBJ/EMBL/GenBank
collaborative meeting, it was learned that IUPAC plans to adopt the
letter 'U' for selenocysteine.

  DDBJ, EMBL, and GenBank will thus use this new amino acid abbreviation
for its /translation qualifiers. Although a timetable for its appearance
has not been finalized, we are mentioning this now because the introduction
of a new residue abbreviation is a fairly fundamental change.

  Details about the use of 'U' will be made available via these release
notes and the GenBank newsgroup as they become available.

1.4.8 New REFERENCE type for on-line journals

  Agreement was reached at the May 1999 collaborative DDBJ/EMBL/GenBank
meeting that an effort should be made to accomodate references which are
published only on-line. Until specifications for such references are
available from library organizations, GenBank will present them in a manner
like this:

	REFERENCE   1  (bases 1 to 2858)
	  AUTHORS   Smith, J.
	  TITLE     Cloning and expression of a phospholipase gene
	  JOURNAL   Online Publication
	  REMARK    Online-Journal-name; Article Identifier; URL

  This format is still tentative; additional information about this new
reference type will be made available via these release notes.


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
- GenBank newsgroup see:       
- GENBANKB e-mail: messages sent to genbankb at
- subscribe: e-mail biosci-server at with: subscribe genbankb
- unsub: e-mail biosci-server at with: unsubscribe genbankb      
- GenBank on the WWW, see:
- problems with GENBANKB? E-mail moderator: francis at                  

More information about the Genbankb mailing list