Improvements to BLAST databases
recipon at ncbi.nlm.nih.gov
Tue Apr 4 11:07:58 EST 1995
As part of NCBI's ongoing efforts to provide BLAST users with access to
the most up-to-date, comprehensive and useful databases, the following
changes are planned for BLAST databases:
1) Closer synchronization between BLAST and other NCBI databases.
2) Separation of EST data from other non-redundant nucleotide sequences.
3) A new non-redundant protein sequence database.
4) Discontinued support for seldom-used or unmaintained databases.
We expect to implement these changes in July 1995. More details about
each proposed change are provided below. If you have any questions or
concerns, please direct them to blast-help at ncbi.nlm.nih.gov
Closer synchronization between BLAST and other databases:
BLAST databases will be built from the same source data that are
now used for Entrez, the Retrieve e-mail server and the daily
updates. This means that sequences identified in a BLAST search
will always be accessible from NCBI search services.
As part of the project to improve synchrony among BLAST and other
forms of the sequence databases, every sequence will have a
unique identifier called the "gi" number. The gi number of an
entry changes with each update to the sequence data, something not
necessarily true for the accession number or locus name. This
will allow Entrez or Retrieve users to be certain that they are
retrieving exactly the same revision of the sequence identified
by BLAST. Additionally a gi number allows easy automated
retrieval of sequences from database interfaces. Retrieval by
accession number or locus name will, of course, continue to be
The non-redundant BLAST nucleotide databases contain all of the
data submitted to the international sequence database
collaborators: GenBank, EMBL and DDBJ. Since data are quickly
exchanged among the three, there is no need to search them
individually. Therefore the option of specifying a specific
database will be removed. Users may be assured of searching all
publically available sequences, regardless of the database of
Separation of EST data:
In order to give users more control over their BLAST searches, the
EST division will be split off from the other GenBank divisions.
This separation is necessitated by the phenomenal growth in the
EST division, which will increase by about 4000 to 6000
sequences/week until the summer of 1996. Partitioning the
non-redundant database will assure that non-EST matches are not
masked by the tremendous number of EST sequences. Conversely it
will be straightforward to search the EST division and be assured
of only EST hits. While this change will require that some users
modify their search strategy, we believe that the ability to
better specify the contents of the database will make BLAST
searches much more productive for most users. The reconfigured
databases will retain the names "nr" and "dbest" so as not to
break existing scripts. The "new" nr will also differ from the
current non-redundant nucleotide database in having a common
origin with Entrez and other NCBI source databases.
New non-redundant protein sequence database:
The protein sequences now available in Entrez will be searchable
in a non-redundant database called "nr". This database will be
comprised of pdb; swiss-prot; pir sequences not found in pdb or
swiss-prot; prf not covered in pdb, swiss-prot or pir; and all
conceptual translations from GenBank sequences not in any of the
other databases. As with the non-redundant nucleotide database,
the nr protein database will be derived from the same source as
Entrez and other NCBI databases.
A second new protein sequence database containing only sequences
from swiss-prot and pdb will also be available. This database,
called "spdb", will allow users to restrict searches to these two
highly annotated sources.
Discontinued support for databases:
The Kabat, EPD, and TFD databases are either infrequently used or
not regularly updated. Therefore BLAST access to these databases
will be discontinued.
--blast-help at ncbi.nlm.nih.gov
More information about the Microbio