Greetings GenBank Users,
As described in the release notes for GenBank Releases 117.0 and 118.0,
the format of the "index" files which accompany GenBank sequence data
files will change as of Release 119.0 (August, 2000). This post provides
some final details regarding these changes.
Best regards,
Mark Cavanaugh
GenBank
NCBI/NLM/NIH
PS: Don't forget that all GenBank Release and GenBank Update products will
be gzip'd rather than Unix-compressed when the Release 119.0 data files
are made available on NCBI's ftp site.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Headers will no longer be present.
As of 119.0, the GenBank index files will no longer include a header
describing their contents. Here is an example of the old header for
the gbkey.idx index file:
GBKEY.IDX Genetic Sequence Data Bank
15 April 2000
NCBI-GenBank Flat File Release 117.0
Keyword Phrase Index
6215002 loci, 7376080723 bases, from 6215002 reported sequences
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
A TAB-delimited format will be used for most of the indexes.
The fixed-column tabular-style format utilized for the gbaut.idx, gbgen.idx,
gbjou.idx, and gbkey.idx indexes will be replaced by a line-oriented,
TAB-delimited format. The indexed terms will remain on their own lines:
Indexed-Term
LOCUS-name1 Div-code1 Accession1
LOCUS-name2 Div-code2 Accession2
LOCUS-name3 Div-code3 Accession3
....
Here is an example of the new format, in which TAB characters are displayed
as ^I, and carriage-returns/newlines as $ :
(H+,K+)-ATPASE BETA-SUBUNIT$
^IRATHKATPB^IROD^IM55655$
^IMUSATP4B1^IROD^IM64685$
^IMUSATP4B2^IROD^IM64686$
^IMUSATP4B3^IROD^IM64687$
^IMUSATP4B4^IROD^IM64688$
^IDOGATPASEB^IMAM^IM76486$
When viewed by a file browser such as 'less' or 'more' :
(H+,K+)-ATPASE BETA-SUBUNIT
RATHKATPB ROD M55655
MUSATP4B1 ROD M64685
MUSATP4B2 ROD M64686
MUSATP4B3 ROD M64687
MUSATP4B4 ROD M64688
DOGATPASEB MAM M76486
Note that the index terms can be distinguished from LOCUS/DIV/ACCESSION
by the fact that they do not start with a TAB character. So one can
extract just the terms via simple text-processing:
perl -ne 'print unless /^\s+/' < gbkey.idx > terms.gbkey
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Secondaries removed from gbacc.idx, and a TAB-delimited format utilized.
The content of the accession number index file will be limited to
just the primary accessions of the records that appear in a GenBank
release.
The format will be similar to that described above, but the indexed
term (Accession.Version) will not be on a separate line:
Accession1.Version1 Locus-name1 Div-code1 Accession1
Accession2.Version2 Locus-name2 Div-code2 Accession2
....
Here is an example of the new format, in which TAB characters are displayed
as ^I, and carriage-returns/newlines as $ :
AC000102.1^IAC000102^IPRI^IAC000102$
AC000103.1^IAC000103^IPLN^IAC000103$
AC000104.1^IF19P19^IPLN^IAC000104$
AC000105.40^IAC000105^IPRI^IAC000105$
AC000106.1^IF7G19^IPLN^IAC000106$
AC000107.1^IAC000107^IPLN^IAC000107$
AC000108.1^IAC000108^IBCT^IAC000108$
AC000109.1^IHSAC000109^IPRI^IAC000109$
AC000110.1^IHSAC000110^IPRI^IAC000110$
When viewed by a file browser such as 'less' or 'more' :
AC000102.1 AC000102 PRI AC000102
AC000103.1 AC000103 PLN AC000103
AC000104.1 F19P19 PLN AC000104
AC000105.40 AC000105 PRI AC000105
AC000106.1 F7G19 PLN AC000106
AC000107.1 AC000107 PLN AC000107
AC000108.1 AC000108 BCT AC000108
AC000109.1 HSAC000109 PRI AC000109
AC000110.1 HSAC000110 PRI AC000110
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
A new index file for secondary accessions will be introduced.
The secondary accessions removed from gbacc.idx will appear in a
new index file called gbsec.idx .
The format of this new index file will be identical to that
used for gbaut.idx (since multiple GenBank records can have the
same secondary).
Here is an example of the new format, in which TAB characters are displayed
as ^I, and carriage-returns/newlines as $ :
Z70297$
^IAF022186^IPLN^IAF022186$
Z71175$
^IAF029714^IBCT^IAF029714$
Z71256$
^IYSCD9476^IPLN^IU28372$
^IYSCD9481^IPLN^IU28373$
^IYSCD9740^IPLN^IU28374$
^ISCD9509^IPLN^IU32274$
^IYSCD9798^IPLN^IU32517$
^ISCD9461^IPLN^IU33007$
^ISCD8035^IPLN^IU33050$
^ISCD9717^IPLN^IU33057$
^ISCU43834^IPLN^IU43834$
^IYSCD9954^IPLN^IU51030$
^IYSCD9819^IPLN^IU51031$
^IYSCD9651^IPLN^IU51032$
Z71336$
^IAF072686^IVRT^IAF072686$
Z95845$
^IAF005248^IBCT^IAF005248$
When viewed by a file browser such as 'less' or 'more' :
Z70297
AF022186 PLN AF022186
Z71175
AF029714 BCT AF029714
Z71256
YSCD9476 PLN U28372
YSCD9481 PLN U28373
YSCD9740 PLN U28374
SCD9509 PLN U32274
YSCD9798 PLN U32517
SCD9461 PLN U33007
SCD8035 PLN U33050
SCD9717 PLN U33057
SCU43834 PLN U43834
YSCD9954 PLN U51030
YSCD9819 PLN U51031
YSCD9651 PLN U51032
Z71336
AF072686 VRT AF072686
Z95845
AF005248 BCT AF005248
---
- gttaacaattaaagagtgtttatcgaaattcattatatagtggtttatatagaccacttc
-
- GenBank newsgroup see: http://www.bio.net/hypermail/genbankb/
- GENBANKB e-mail: messages sent to genbankb at net.bio.net
- subscribe: e-mail biosci-server at net.bio.net with: subscribe genbankb
- unsub: e-mail biosci-server at net.bio.net with: unsubscribe genbankb
- GenBank on the WWW, see: http://www.ncbi.nlm.nih.gov/Genbank/
- problems with GENBANKB? E-mail moderator: francis at cmmt.ubc.ca