GenBank : New Index File Format : Release 119.0

Mark Cavanaugh cavanaug at lagrange.nlm.nih.gov
Thu Jul 27 11:47:56 EST 2000


Greetings GenBank Users,

As described in the release notes for GenBank Releases 117.0 and 118.0,
the format of the "index" files which accompany GenBank sequence data
files will change as of Release 119.0 (August, 2000). This post provides
some final details regarding these changes.

Best regards,

Mark Cavanaugh
GenBank
NCBI/NLM/NIH

PS: Don't forget that all GenBank Release and GenBank Update products will
be gzip'd rather than Unix-compressed when the Release 119.0 data files
are made available on NCBI's ftp site.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Headers will no longer be present.

As of 119.0, the GenBank index files will no longer include a header
describing their contents. Here is an example of the old header for
the gbkey.idx index file: 

GBKEY.IDX          Genetic Sequence Data Bank
                         15 April 2000

               NCBI-GenBank Flat File Release 117.0

                       Keyword Phrase Index

 6215002 loci,  7376080723 bases, from 6215002 reported sequences

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

A TAB-delimited format will be used for most of the indexes.

The fixed-column tabular-style format utilized for the gbaut.idx, gbgen.idx,
gbjou.idx, and gbkey.idx indexes will be replaced by a line-oriented,
TAB-delimited format. The indexed terms will remain on their own lines:

Indexed-Term
	LOCUS-name1	Div-code1	Accession1
	LOCUS-name2	Div-code2	Accession2
	LOCUS-name3	Div-code3	Accession3
	....

Here is an example of the new format, in which TAB characters are displayed
as ^I, and carriage-returns/newlines as $ :

(H+,K+)-ATPASE BETA-SUBUNIT$
^IRATHKATPB^IROD^IM55655$
^IMUSATP4B1^IROD^IM64685$
^IMUSATP4B2^IROD^IM64686$
^IMUSATP4B3^IROD^IM64687$
^IMUSATP4B4^IROD^IM64688$
^IDOGATPASEB^IMAM^IM76486$

When viewed by a file browser such as 'less' or 'more' :

(H+,K+)-ATPASE BETA-SUBUNIT
        RATHKATPB       ROD     M55655
        MUSATP4B1       ROD     M64685
        MUSATP4B2       ROD     M64686
        MUSATP4B3       ROD     M64687
        MUSATP4B4       ROD     M64688
        DOGATPASEB      MAM     M76486

Note that the index terms can be distinguished from LOCUS/DIV/ACCESSION
by the fact that they do not start with a TAB character. So one can
extract just the terms via simple text-processing:

	perl -ne 'print unless /^\s+/' < gbkey.idx > terms.gbkey

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Secondaries removed from gbacc.idx, and a TAB-delimited format utilized.

The content of the accession number index file will be limited to
just the primary accessions of the records that appear in a GenBank
release.

The format will be similar to that described above, but the indexed
term (Accession.Version) will not be on a separate line:

Accession1.Version1	Locus-name1	Div-code1	Accession1
Accession2.Version2	Locus-name2	Div-code2	Accession2
....

Here is an example of the new format, in which TAB characters are displayed
as ^I, and carriage-returns/newlines as $ :

AC000102.1^IAC000102^IPRI^IAC000102$
AC000103.1^IAC000103^IPLN^IAC000103$
AC000104.1^IF19P19^IPLN^IAC000104$
AC000105.40^IAC000105^IPRI^IAC000105$
AC000106.1^IF7G19^IPLN^IAC000106$
AC000107.1^IAC000107^IPLN^IAC000107$
AC000108.1^IAC000108^IBCT^IAC000108$
AC000109.1^IHSAC000109^IPRI^IAC000109$
AC000110.1^IHSAC000110^IPRI^IAC000110$

When viewed by a file browser such as 'less' or 'more' :

AC000102.1      AC000102        PRI     AC000102
AC000103.1      AC000103        PLN     AC000103
AC000104.1      F19P19  PLN     AC000104
AC000105.40     AC000105        PRI     AC000105
AC000106.1      F7G19   PLN     AC000106
AC000107.1      AC000107        PLN     AC000107
AC000108.1      AC000108        BCT     AC000108
AC000109.1      HSAC000109      PRI     AC000109
AC000110.1      HSAC000110      PRI     AC000110

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

A new index file for secondary accessions will be introduced.

The secondary accessions removed from gbacc.idx will appear in a
new index file called gbsec.idx .

The format of this new index file will be identical to that
used for gbaut.idx (since multiple GenBank records can have the
same secondary).

Here is an example of the new format, in which TAB characters are displayed
as ^I, and carriage-returns/newlines as $ :

Z70297$
^IAF022186^IPLN^IAF022186$
Z71175$
^IAF029714^IBCT^IAF029714$
Z71256$
^IYSCD9476^IPLN^IU28372$
^IYSCD9481^IPLN^IU28373$
^IYSCD9740^IPLN^IU28374$
^ISCD9509^IPLN^IU32274$
^IYSCD9798^IPLN^IU32517$
^ISCD9461^IPLN^IU33007$
^ISCD8035^IPLN^IU33050$
^ISCD9717^IPLN^IU33057$
^ISCU43834^IPLN^IU43834$
^IYSCD9954^IPLN^IU51030$
^IYSCD9819^IPLN^IU51031$
^IYSCD9651^IPLN^IU51032$
Z71336$
^IAF072686^IVRT^IAF072686$
Z95845$
^IAF005248^IBCT^IAF005248$

When viewed by a file browser such as 'less' or 'more' :

Z70297
        AF022186        PLN     AF022186
Z71175
        AF029714        BCT     AF029714
Z71256
        YSCD9476        PLN     U28372
        YSCD9481        PLN     U28373
        YSCD9740        PLN     U28374
        SCD9509 PLN     U32274
        YSCD9798        PLN     U32517
        SCD9461 PLN     U33007
        SCD8035 PLN     U33050
        SCD9717 PLN     U33057
        SCU43834        PLN     U43834
        YSCD9954        PLN     U51030
        YSCD9819        PLN     U51031
        YSCD9651        PLN     U51032
Z71336
        AF072686        VRT     AF072686
Z95845
        AF005248        BCT     AF005248




---


- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/       
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb      
- GenBank on the WWW, see:  http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca                  








More information about the Genbankb mailing list