NCBI Search CD-ROM File Format

Warren Gish gish at host.nlm.nih.gov
Fri Nov 13 18:53:59 EST 1992


The following announcment of the NCBI Search CD-ROM file format for sequence
data is directed primarily to software developers.  Users who wish to be
notified when the Search CD-ROM is available for ordering should subscribe
to "NCBI News".  A free subscription to this newsletter can be obtained by
sending your name and postal address to:  info at ncbi.nlm.nih.gov

Sample Search CD-ROM data files and reference source code is available now, via
anonymous ftp, on ncbi.nlm.nih.gov (130.14.20.1) beneath the /pub/searchcd
directory.  Additional recommended software and documentation is posted on the
same ftp server beneath the /toolbox/ncbi_tools_rel2.0 and /pub/gish
directories.  While the software beneath /pub/searchcd has been used locally to
produce the sample Search files, it is posted only as a developer reference.
The code does not compile and link into an executable application and it should
not be considered a stable basis for development of application software.

A limited quantity of a prototype Search CD-ROM will be available within 8
weeks.  Developers interested in obtaining a copy should send their name and
postal address before Friday, November 20 to:  searchcd at ncbi.nlm.nih.gov or
contact me by phone.

Sincerely,
Warren Gish
NCBI/NLM
Voice:  301-496-2475
FAX:  301-480-9241



           NCBI Search CD-ROM File Format -- Sample Distribution


INTRODUCTION

The NCBI Search CD-ROM will be a vehicle for distributing molecular sequence
data in a format that is efficient for sequence similarity searches using
methods such as FASTA and BLAST.  The Search format further supports multiple
definitions linked to each sequence (as may be desired in a nonredundant
collection) and facilitates the application of incremental updates such as
might be obtained via a network.  The NCBI intends to migrate much of its
software over to using the "Search CD" format.

The set of sequences distributed on the Entrez:Sequences CD-ROM will be
distributed every two months on a Search CD-ROM, as well, in the format
outlined below and detailed in the accompanying C language source files.  As
space permits, sequences from the NCBI's GenInfo(R) Integrated Database, which
will be available within the next several months, will also be included on the
Search CD-ROM in a separate directory.

Prof. William Pearson (Univ. of Virginia) has generously offered to produce PC
and Mac versions of FASTA to search files on the Search CD-ROM; these programs
will be distributed on the Search CD-ROM itself, with software support handled
by Dr. Pearson.  As they become available, UNIX, Mac, and PC versions of BLAST
software from the NCBI will be posted "as is" for anonymous ftp access on
ncbi.nlm.nih.gov.

Commercial providers of software and services which support the NCBI Search
CD-ROM are encouraged to notify the NCBI of their product availability.  Their
names and addresses will be placed on file and will be made available to
potential customers who inquire about support.  Please send all such
announcements to searchcd at ncbi.nlm.nih.gov



SAMPLE FILES

There are 5 subdirectories in this sample distribution:

  src -- C language source files describing the Search CD-ROM file formats.
         Additional supporting source code is posted on ncbi.nlm.nih.gov
         beneath the /toolbox/ncbi_tools_rel2.0 and /pub/gish directories.
  search -- a mock Search CD-ROM with some sample files
  search/entrez -- sample files in Search CD-ROM format using a very small
                   subset of the Entrez:Sequences data
  search/giid -- sample List file describing the division of GenInfo(R)
                 Integrated database data into separate files
  fasta -- the FASTA-format data used to generate the Search CD-ROM files
           in the "search/entrez" directory.  FASTA files will not be
           distributed on the Search CD-ROM.

The most important single file for understanding the internal structure of
files on the Search CD-ROM is src/headers.h.  It is recommended that
src/headers.h be browsed prior to studying the remainder of this document.
Individual C structure elements are specified in the same order in which the
data are stored within the files.  All "unsigned long" integers are stored in a
4-byte big-endian format.  Byte alignment is not observed--all elements are
close packed.



FIVE DISTINCT FILE TYPES

For sequence similarity searching, only two types of files are of primary
concern:  Sequence and Definition.  Each record in a Sequence file includes the
record length, a sequence identifier, an encoded and/or packed sequence, and
the file offset into the Definitions file for a linked list of Definition
records.  Each Definition record contains zero or more sequence identifiers and
associated human-readable definitions, plus a (possibly NULL) file offset to
the next Definition record in the linked list.

A third file type, Table, contains file offsets into matching Sequence and
Definition files.  One potential use for a Table file is in the assignment of
sequence records to individual processors in a multiprocessing or multithreaded
implementation of a search program.  A Table file may also store offsets to
only a subset of the records in the database.  A more obscure use of a Table
file would be to provide offsets to alternate or ancillary definitions from
those pointed to within the Sequence file itself.

The fourth file type, Index, is used to look up Sequence and Definition record
offsets, given a sequence identifier.  An Index file can contain offsets into
either amino acid or nucleic acid sequence files.

The fifth file type, List, is used to describe an entire collection of Search
files, including their base filenames, residue type (aa or nt), and division
(e.g., primates, plants, invertebrates, etc.)  This is the only file type whose
content is entirely human readable text.  List files may be a focal point for
customizing the installation of the other Search files.



FILE SIZE RESTRICTIONS

The Search CD-ROM will be produced in ISO 9660 format, so on some computing
platforms each of the file names will have a ";1" suffix tacked on.  The ISO
9660 format restricts files to being 64 MB or less in size, while MS-DOS
further restricts files to being 32 MB or less.  Consequently, it may be
necessary to split some data sets or divisions into multiple sets of Sequence,
Definition, and Table files in order to stay under the 32 MB limit.  Software
written to support Search files for other operating systems and media should
not be so limited.  Index files can support nearly 2 million records
(identifiers) and still reside below the 32 MB limit, so it is not anticipated
that Index files will need to be split for a considerable length of time.



FILE NAMING CONVENTIONS

Sequence, Definition, and Table files that work together will share a common
base filename.  Similarly, List and Index files that work together will share a
common base filename.  For Sequence, Definition and Table files, their filename
extensions will depend on the type of sequence (aa or nt) to which the files
refer:  *.[an]sq *.[an]df *.[an]tb.  List and Index files can refer to both
types of files.  The filename extension for a List file is .fls.  The filename
extension for an Index file is .idx.

Sequence, Definition, Table, and Index files each begin with a header structure
described in headers.h.  The four header types begin with a structure of common
or standard elements (ScdStdHdr).  Following the standard header elements and
the file type-specific elements which complete each header, there will be zero
or more records in a format specific to the particular file type.



SEQUENCE IDENTIFIERS

Sequence identifiers are unsigned integers.  The identifiers are drawn
from the same name space used for sequence identifiers on release 1.0
of the Entrez:Sequences CD-ROM.  At present, these are "import identifiers" or
"giim" identifiers, but will ultimately be "GenInfo Integrated Database" or
"giid" identifiers.  By using the same identifiers as the Entrez:Sequences
CD-ROM, it is possible to use the identifiers obtained by a similarity
search to directly look-up records on the Entrez:Sequences CD-ROM.



REDUNDANCY OF SEQUENCE INFORMATION

When practical, Search CD-ROM database files will be "nonredundant"--100%
identical sequences will be merged into a single sequence-definition record
pair.  For instance, when two or more sequences are absolutely identical at the
sequence level, only one of their identifiers will be stored in the Sequence
record, but all of their identifiers will be available in the corresponding
Definition record.  The 'def_off' element in the Sequence record points to the
head of the relevant linked list of Definition record(s) in the Definition
file.



DIVISIONAL CLUSTERING OF SEQUENCES

Protein sequences are likely to be distributed in a single "All" division,
which may need to be split into multiple sets of Sequence-Definition-Table
files in order to avoid breaking the aforementioned 32 MB file size barrier.

Nucleotide sequences will be distributed in the familiar GenBank(R) divisions,
except Organelle which will disappear, plus a new EST division.  The contents
of the List format file giid/giid.fls provides a better outline than does the
entrez/entrez.fls file of the divisions that are planned.  For instance, in the
sample Entrez files, NLM Backbone sequences appear in a separate division which
will not be present in future releases.  Again, it may be necessary to split
divisions into multiple sets of Search files to avoid the file size barrier.
"DV" records in .fls files indicate the particular division to which each set
of files belongs.



DEFINITIONS & ACCESSION NUMBERS

As stored in a Definition record, Definition strings consist of a 4-byte length
field followed by a NON-null-terminated, printable ASCII string of the
specified length.  In Search CD-ROM files, the binary encoded 'seqid' elements
will NOT be repeated in printable ASCII form within the human readable
definition strings.  The definition strings may, however, contain identifiers
from other name spaces.  Definition strings in the sample files begin with an
acronym representing the database from which the definition was obtained,
followed by a colon (:), the accession number in the specified database, and
finally the definition.  The database acronyms used are:  GB=GenBank(R),
PIR=the NBRF PIR(R), SP=SWISS-PROT.  Other acronyms that



More information about the Bio-soft mailing list