Changes to the BLAST Databases
madden at corin.nlm.nih.gov
Fri Mar 8 14:19:44 EST 1996
Changes to the BLAST Databases
March 8, 1996
This announcement describes a reorganization of the databases available for
BLAST searches at the National Center for Biotechnology Information (NCBI).
The same sequence data will be available for searching but will be organized
for more efficient searching and will be better synchronized with the Entrez
The major differences will be the elimination of EST and STS sequences from
the 'nr' (non-redundant database) and the introduction of a database ('month')
containing only the sequences added over the past 30 days. Another change
is a new definition line for protein sequences.
WWW Blast and E-mail Blast users will switch to the new set of databases
beginning March 11, 1996. Since most users search 'nr', the change should be
minimal since the database name will stay the same, but EST and STS sequences
will not be searched.
For users of Network Blast, a new client (Blast2) is being introduced that will
not only search the new set of databases, but also provide a better interface
for post-processing search results. Blast2 represents the future direction
of the Blast service and users of the existing Blast software, known as the
'Experimental' Blast service are encouraged to upgrade to Blast2. However,
the existing 'Experimental' Blast clients will be able to operate with the
new databases. (See Appendix 3 for technical details). Blast2 clients are
available now for FTP and users of the 'Experimental' Blast clients are able
to use the new databases now. Beginning March 11, 1996, the old databases
will no longer be available and both the Experimental and Blast2 clients will
use the new databases.
These changes are described further below in the following topics:
* New databases
* Sequence identifiers
* The Blast2 service
* A new Entrez-based e-mail server
* Databases on the FTP site
Comments about these changes are welcome, please send them to
blast-help at ncbi.nlm.nih.gov. For information about other NCBI services,
send e-mail to: info at ncbi.nlm.nih.gov
Presently both the old and the new databases are available. The
old databases will be available until March 11, 1996, at which time only the
new databases will be available. The new databases are now searchable with
the Network version of Blast (see Appendix 3).
New nucleotide databases:
nr Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's or STS's)
est Non-redundant Database of GenBank+EMBL+DDBJ EST Division
sts Non-redundant Database of GenBank+EMBL+DDBJ STS Division
pdb PDB nucleotide sequences
vector Vector subset of GenBank
mito Database of mitochondrial sequences, Rel. 1.0, July 1995
kabat Kabat Sequences of Nucleic Acid of Immunological Interest
epd Eukaryotic Promotor Database
alu Select Alu Repeats from REPBASE
month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last
New protein databases:
nr Non-redundant GenBank CDS translations+PDB+SwissProt+PIR
pdb PDB protein sequences
spdb Non-redundant SwissProt+PDB sequences
kabat Kabat Sequences of Proteins of Immunological Interest
alu Translations of Select Alu Repeats from REPBASE
month All new or revised GenBank CDS translation+PDB+SwissProt+PIR sequences
released in the last 30 days
swissprot SwissProt sequences
Sequence identifiers for the new databases:
The one-line descriptions for GenBank conceptual translations will
change. The present descriptions describe the conceptual translation of a CDS
in terms of the GenBank flatfile, but do not reliably point to a specific
CDS if the order or number of CDS features changes. An example is:
"gp|U04987|SIU04987_4 env gene product [Simian immunodef...";
"SIU04987_4" indicates that this protein is the fourth CDS on the entry with
the accession U04987. Changes to the GenBank entry can change the order and
number of CDS features.
Therefore, in order to identify the specific protein sequence NCBI is now
assigning a stable identifier, called a 'gi' for all sequences. A "gi" is
a unique integer that changes when the sequence changes. It does not change,
however, if only the features or references of an entry are updated.
The new format for protein sequences will contain the identifer 'gi'
followed by the 'gi number':
"gi|451623 (U04987) env [Simian immunodeficiency..."
Although the accession number of the translated nucleotide sequence will
appear in the header line (U04987 in the example above), retrieval by
'gi number' is the only reliable method to locate the correct translated
DNA sequence. The new e-mail retriever (see below) or Entrez may be used
to retrieve sequences identified by "gi".
An exhaustive list of sequence identifiers used in these new databases is
provided in Appendix 1. Additional examples of definition lines are
provided in Appendix 2.
The Blast2 service:
Blast2 is the newest version of the BLAST client software and represents
the foundation for NCBI's future development of the BLAST service.
The Blast2 service permits BLAST searches with a number of
different clients for different platforms, available on the NCBI FTP site.
These clients can be obtained by FTP'ing to ncbi.nlm.nih.gov (login as anonymous
and cd to blast/network/blast2). In contrast to the present BLAST service (designated
"experimental"), these clients communicate with the BLAST server through
a structured interface, allowing BLAST to interface better with other programs,
e.g., post-processing programs. The blast2 service already uses the new databases.
Although Blast2 is expected to eventually replace the 'experimental' Blast
clients, NCBI will continue to support the 'experimental' Blast client for
the near future.
New Entrez-based e-mail retrieve server ("QUERY"):
QUERY uses the Entrez Query Engine to obtain data. Entrez can retrieve
data by domain (i.e., nucleotide or protein) rather than by source database.
QUERY can retrieve entries by "gi" (see above) and is synchronized with the
new BLAST databases. To receive documentation about this service,
send an email to "query at ncbi.nlm.nih.gov". The body of the message
should consist of the word "help" (without quotes).
Databases on the FTP site
All the databases listed above are available as FASTA files from the
NCBI FTP site (ncbi.nlm.nih.gov). These FASTA files are not necessary to
perform BLAST searches using the BLAST clients discussed here. They are
only needed if one wishes to run the actual BLAST search engines in-house,
rather than sending BLAST queries to the NCBI. To obtain these files, FTP
to ncbi.nlm.nih.gov, login as anonymous and cd to "blast/db". These files
are compressed and should be FTP'ed in binary mode.
A FASTA file ("genpept.fsa") containing all the proteins in the
GenBank release will also be available from the NCBI FTP site, in the directory
"genbank". The one-line headers in this FASTA file have the same format
as those presented in Appendix 1. Daily updates to this file are gpcu.fsa,
in the directory "genbank/daily". These files serves as replacements for
"genpept.fasta" and "gpcu.fasta", which will be discontinued on March 25, 1996.
Appendix 1: Sequence Identifier Syntax
The syntax of sequence header lines used by the NCBI BLAST server depends on
the database from which each sequence was obtained. The table below lists
the identifiers for the databases from which the sequences were derived.
Database Name Identifier Syntax
EMBL Data Library emb|accession|locus
DDBJ, DNA Database of Japan dbj|accession|locus
NBRF PIR pir||entry
Protein Research Foundation prf||name
SWISS-PROT sp|accession|entry name
Brookhaven Protein Data Bank pdb|entry|chain
Kabat's Sequences of Immuno... gnl|kabat|identifier
GenInfo Backbone Id bbs|number
For example, an identifier might be "gb|M73307|AGMA13GT", where the "gb" tag
indicates that the identifier refers to a GenBank sequence, "M73307" is its
GenBank ACCESSION, and "AGMA13GT" is the GenBank LOCUS.
"gi" identifiers are being assigned by NCBI for all sequences contained
within NCBI's sequence databases. The 'gi' identifier provides a uniform
and stable naming convention whereby a specific sequence is assigned
its unique gi identifier. If a nucleotide or protein sequence changes,
however, a new gi identifier is assigned, even if the accession number
of the record remains unchanged. Thus gi identifiers provide a mechanism
for identifying the exact sequence that was used or retrieved in a
For searches of the nr protein database where the sequences are derived
from conceptual translations of sequences from the nucleotide databases
the following syntax is used:
An example would be:
gi|451623 (U04987) env [Simian immunodeficiency..."
where '451623' is the gi identifier and the 'U04987' is the accession
number of the nucleotide sequence from which it was derived.
Users are encouraged to use the '-gi' option for Blast output which will
produce a header line with the gi identifer concatenated with the database
identifier of the database from which it was derived, for example, from a
And similarly for protein databases:
Appendix 2: Examples of sequence header lines in Blast output:
gi|808969 (V00383) reading frame [Gallus gallus] 641 4.6e-99 2
gi|763101 (V00387) seventh exon [Gallus gallus] 690 2.6e-90 1
(note: gi numbers used for GenBank translated sequences; other protein sequences
are designated according to database of origin, e.g., Swiss-Prot, PDB, PRF).
sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED). ... 1191 3.0e-159 1
sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED). ... 949 2.7e-126 1
pdb|1OVA|A Ovalbumin (Egg Albumin) >pdb|1OVA|B ... 645 1.3e-99 2
prf||0705172A ovalbumin [Gallus gallus] 645 1.3e-99 2
gb|U37104|APU37104 Aethia pusilla cytochrome b gene, mi... 1672 1.2e-133 1
gb|U37087|ACU37087 Aethia cristatella cytochrome b gene... 1627 5.7e-133 2
emb|F19596|HSPD04201 H.sapiens mitochondrial EST sequence... 997 3.9e-77 1
emb|F19081|HSPD03679 H.sapiens mitochondrial EST sequence... 939 2.8e-72 1
gb|L44587|CALMTCYBF Callithrix emiliae (clones CEM 1, CE... 785 4.0e-59 1
gb|L44588|CALMTCYBFA Callithrix jacchus (clones CJA1, CJA... 695 1.5e-51 1
Appendix 3: Technical details
The new databases may be searched using the existing ('experimental') client
that connects to a different port than the default. The 'experimental' server
normally connects to port 5555 (service is "blast"). The new databases are
available by connecting to port 5559 (service is "xblast"). 'Experimental'
clients for UNIX, using "xblast", are available from the NCBI FTP site under
Blast2 clients search only the new databases and are available now on the NCBI
FTP site in blast/network/blast2.
More information about the Bio-soft