Improvements to BLAST databases

Herve Recipon recipon at
Tue Apr 4 11:07:58 EST 1995

As part of NCBI's ongoing efforts to provide BLAST users with access to
the most up-to-date, comprehensive and useful databases, the following
changes are planned for BLAST databases:

1)  Closer synchronization between BLAST and other NCBI databases.

2)  Separation of EST data from other non-redundant nucleotide sequences.

3)  A new non-redundant protein sequence database.

4)  Discontinued support for seldom-used or unmaintained databases.

We expect to implement these changes in July 1995.  More details about
each proposed change are provided below.  If you have any questions or
concerns, please direct them to blast-help at

Closer synchronization between BLAST and other databases:

     BLAST databases will be built from the same source data that are
     now used for Entrez, the Retrieve e-mail server and the daily
     updates.  This means that sequences identified in a BLAST search
     will always be accessible from NCBI search services.

     As part of the project to improve synchrony among BLAST and other
     forms of the sequence databases, every sequence will have a
     unique identifier called the "gi" number.  The gi number of an
     entry changes with each update to the sequence data, something not
     necessarily true for the accession number or locus name.  This
     will allow Entrez or Retrieve users to be certain that they are
     retrieving exactly the same revision of the sequence identified
     by BLAST.  Additionally a gi number allows easy automated
     retrieval of sequences from database interfaces.  Retrieval by 
     accession number or locus name will, of course, continue to be 

     The non-redundant BLAST nucleotide databases contain all of the
     data submitted to the international sequence database
     collaborators: GenBank, EMBL and DDBJ.  Since data are quickly
     exchanged among the three, there is no need to search them
     individually.  Therefore the option of specifying a specific
     database will be removed.  Users may be assured of searching all
     publically available sequences, regardless of the database of

Separation of EST data:

     In order to give users more control over their BLAST searches, the
     EST division will be split off from the other GenBank divisions.
     This separation is necessitated by the phenomenal growth in the
     EST division, which will increase by about 4000 to 6000
     sequences/week until the summer of 1996.  Partitioning the
     non-redundant database will assure that non-EST matches are not
     masked by the tremendous number of EST sequences.  Conversely it
     will be straightforward to search the EST division and be assured
     of only EST hits.  While this change will require that some users
     modify their search strategy, we believe that the ability to
     better specify the contents of the database will make BLAST
     searches much more productive for most users.  The reconfigured
     databases will retain the names "nr" and "dbest" so as not to
     break existing scripts.  The "new" nr will also differ from the
     current non-redundant nucleotide database in having a common
     origin with Entrez and other NCBI source databases.

New non-redundant protein sequence database:
     The protein sequences now available in Entrez will be searchable
     in a non-redundant database called "nr".  This database will be
     comprised of pdb; swiss-prot; pir sequences not found in pdb or
     swiss-prot; prf not covered in pdb, swiss-prot or pir; and all
     conceptual translations from GenBank sequences not in any of the
     other databases.  As with the non-redundant nucleotide database,
     the nr protein database will be derived from the same source as
     Entrez and other NCBI databases.
     A second new protein sequence database containing only sequences
     from swiss-prot and pdb will also be available.  This database,
     called "spdb", will allow users to restrict searches to these two
     highly annotated sources.
Discontinued support for databases:
     The Kabat, EPD, and TFD databases are either infrequently used or
     not regularly updated.  Therefore BLAST access to these databases
     will be discontinued.

--blast-help at

More information about the Bioforum mailing list