Changes to the BLAST Databases

Tom Madden madden at corin.nlm.nih.gov
Fri Mar 8 14:19:44 EST 1996


			
			Changes to the BLAST Databases

				March 8, 1996


This announcement describes a reorganization of the databases available for 
BLAST searches at the National Center for Biotechnology Information (NCBI).  
The same sequence data will be available for searching but will be organized 
for more efficient searching and will be better synchronized with the Entrez
databases.

The major differences will be the elimination of EST and STS sequences from
the 'nr' (non-redundant database) and the introduction of a database ('month')
containing only the sequences added over the past 30 days.  Another change 
is a new definition line for protein sequences.

WWW Blast and E-mail Blast users will switch to the new set of databases
beginning March 11, 1996.  Since most users search 'nr', the change should be
minimal since the database name will stay the same, but EST and STS sequences
will not be searched. 

For users of Network Blast, a new client (Blast2) is being introduced that will
not only search the new set of databases, but also provide a better interface
for post-processing search results.  Blast2 represents the future direction
of the Blast service and users of the existing Blast software, known as the
'Experimental' Blast service are encouraged to upgrade to Blast2.  However,
the existing 'Experimental' Blast clients will be able to operate with the
new databases. (See Appendix 3 for technical details).  Blast2 clients are
available now for FTP and users of the 'Experimental' Blast clients are able
to use the new databases now.  Beginning March 11, 1996, the old databases
will no longer be available and both the Experimental and Blast2 clients will
use the new databases.

These changes are described further below in the following topics:

   * New databases

   * Sequence identifiers

   * The Blast2 service 

   * A new Entrez-based e-mail server

   * Databases on the FTP site

	Comments about these changes are welcome, please send them to
blast-help at ncbi.nlm.nih.gov.  For information about other NCBI services,
send e-mail to: info at ncbi.nlm.nih.gov


===========================================================================

New Databases:

	Presently both the old and the new databases are available.  The 
old databases will be available until March 11, 1996, at which time only the 
new databases will be available.  The new databases are now searchable with
the Network version of Blast (see Appendix 3).

 New nucleotide databases:

 nr	Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's or STS's)
 est	Non-redundant Database of GenBank+EMBL+DDBJ EST Division
 sts	Non-redundant Database of GenBank+EMBL+DDBJ STS Division
 pdb    PDB nucleotide sequences
 vector	Vector subset of GenBank
 mito	Database of mitochondrial sequences, Rel. 1.0, July 1995
 kabat	Kabat Sequences of Nucleic Acid of Immunological Interest
 epd	Eukaryotic Promotor Database
 alu	Select Alu Repeats from REPBASE
 month	All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last
	30 days

 New protein databases:

 nr	Non-redundant GenBank CDS translations+PDB+SwissProt+PIR
 pdb    PDB protein sequences
 spdb	Non-redundant SwissProt+PDB sequences
 kabat	Kabat Sequences of Proteins of Immunological Interest
 alu	Translations of Select Alu Repeats from REPBASE
 month	All new or revised GenBank CDS translation+PDB+SwissProt+PIR sequences 
	released in the last 30 days
 swissprot      SwissProt sequences

===========================================================================

Sequence identifiers for the new databases:

	The one-line descriptions for GenBank conceptual translations will 
change.  The present descriptions describe the conceptual translation of a CDS 
in terms of the GenBank flatfile, but do not reliably point to a specific
CDS if the order or number of CDS features changes.  An example is:

        "gp|U04987|SIU04987_4   env gene product [Simian immunodef...";

"SIU04987_4" indicates that this protein is the fourth CDS on the entry with
the accession U04987.  Changes to the GenBank entry can change the order and 
number of CDS features.

Therefore, in order to identify the specific protein sequence NCBI is now
assigning a stable identifier, called a 'gi' for all sequences.  A "gi" is
a unique integer that changes when the sequence changes.  It does not change,
however, if only the features or references of an entry are updated.
The new format for protein sequences will contain the identifer 'gi'
followed by the 'gi number':

        "gi|451623           (U04987) env [Simian immunodeficiency..."

Although the accession number of the translated nucleotide sequence will 
appear in the header line (U04987 in the example above), retrieval by 
'gi number' is the only reliable method to locate the correct translated 
DNA sequence.  The new e-mail retriever (see below) or Entrez may be used 
to retrieve sequences identified by "gi".

An exhaustive list of sequence identifiers used in these new databases is 
provided in Appendix 1.  Additional examples of definition lines are
provided in Appendix 2.


===========================================================================

The Blast2 service:

Blast2 is the newest version of the BLAST client software and represents
the foundation for NCBI's future development of the BLAST service. 
The Blast2 service permits BLAST searches with a number of
different clients for different platforms, available on the NCBI FTP site.
These clients can be obtained by FTP'ing to ncbi.nlm.nih.gov (login as anonymous
and cd to blast/network/blast2).  In contrast to the present BLAST service (designated 
"experimental"), these clients communicate with the BLAST server through 
a structured interface, allowing BLAST to interface better with other programs,
e.g., post-processing programs.  The blast2 service already uses the new databases.  
Although Blast2 is expected to eventually replace the 'experimental' Blast
clients, NCBI will continue to support the 'experimental' Blast client for
the near future.

===========================================================================

New Entrez-based e-mail retrieve server ("QUERY"): 

QUERY uses the Entrez Query Engine to obtain data. Entrez can retrieve
data by domain (i.e., nucleotide or protein) rather than by source database.
QUERY can retrieve entries by "gi" (see above) and is synchronized with the 
new BLAST databases.  To receive documentation about this service, 
send an email to "query at ncbi.nlm.nih.gov".  The body of the message 
should consist of the word "help" (without quotes).

===========================================================================

Databases on the FTP site
	
	All the databases listed above are available as FASTA files from the
NCBI FTP site (ncbi.nlm.nih.gov).  These FASTA files are not necessary to 
perform BLAST searches using the BLAST clients discussed here.  They are 
only needed if one wishes to run the actual BLAST search engines in-house, 
rather than sending BLAST queries to the NCBI.  To obtain these files, FTP
to ncbi.nlm.nih.gov, login as anonymous and cd to "blast/db".  These files 
are compressed and should be FTP'ed in binary mode.
	
	A FASTA file ("genpept.fsa") containing all the proteins in the 
GenBank release will also be available from the NCBI FTP site, in the directory
"genbank".  The one-line headers in this FASTA file have the same format 
as those presented in Appendix 1.  Daily updates to this file are gpcu.fsa,
in the directory "genbank/daily".  These files serves as replacements for
"genpept.fasta" and "gpcu.fasta", which will be discontinued on March 25, 1996.

===========================================================================


Appendix 1: Sequence Identifier Syntax

The syntax of sequence header lines used by the NCBI BLAST server depends on
the database from which each sequence was obtained.  The table below lists
the identifiers for the databases from which the sequences were derived.
 

  Database Name                     Identifier Syntax
  ============================      ========================
  GenBank                           gb|accession|locus
  EMBL Data Library                 emb|accession|locus
  DDBJ, DNA Database of Japan       dbj|accession|locus
  NBRF PIR                          pir||entry
  Protein Research Foundation       prf||name
  SWISS-PROT                        sp|accession|entry name
  Brookhaven Protein Data Bank      pdb|entry|chain
  Kabat's Sequences of Immuno...    gnl|kabat|identifier
  Patents                           pat|country|number 
  GenInfo Backbone Id               bbs|number 

 
For example, an identifier might be "gb|M73307|AGMA13GT", where the "gb" tag
indicates that the identifier refers to a GenBank sequence, "M73307" is its
GenBank ACCESSION, and "AGMA13GT" is the GenBank LOCUS.  

"gi" identifiers are being assigned by NCBI for all sequences contained
within NCBI's sequence databases.  The 'gi' identifier provides a uniform
and stable naming convention whereby a specific sequence is assigned
its unique gi identifier.  If a nucleotide or protein sequence changes,
however, a new gi identifier is assigned, even if the accession number
of the record remains unchanged. Thus gi identifiers provide a mechanism
for identifying the exact sequence that was used or retrieved in a
given search.

For searches of the nr protein database where the sequences are derived
from conceptual translations of sequences from the nucleotide databases
the following syntax is used:

                     gi|gi_identifier

An example would be:

        gi|451623           (U04987) env [Simian immunodeficiency..."

where '451623' is the gi identifier and the 'U04987' is the accession
number of the nucleotide sequence from which it was derived.

Users are encouraged to use the '-gi' option for Blast output which will
produce a header line with the gi identifer concatenated with the database
identifier of the database from which it was derived, for example, from a
nucleotide database:

        gi|176485|gb|M73307|AGMA13GT

And similarly for protein databases: 

        gi|129295|sp|P01013|OVAX_CHICK



Appendix 2: Examples of sequence header lines in Blast output:

Protein:

gi|808969            (V00383) reading frame [Gallus gallus]    641  4.6e-99   2
gi|763101            (V00387) seventh exon [Gallus gallus]     690  2.6e-90   1

(note: gi numbers used for GenBank translated sequences; other protein sequences
are designated according to database of origin, e.g., Swiss-Prot, PDB, PRF).

sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED). ...  1191  3.0e-159  1
sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED). ...   949  2.7e-126  1
pdb|1OVA|A           Ovalbumin (Egg Albumin) >pdb|1OVA|B ...   645  1.3e-99   2
prf||0705172A        ovalbumin [Gallus gallus]                 645  1.3e-99   2


Nucleotide:

gb|U37104|APU37104   Aethia pusilla cytochrome b gene, mi...  1672  1.2e-133  1
gb|U37087|ACU37087   Aethia cristatella cytochrome b gene...  1627  5.7e-133  2
emb|F19596|HSPD04201 H.sapiens mitochondrial EST sequence...   997  3.9e-77   1
emb|F19081|HSPD03679 H.sapiens mitochondrial EST sequence...   939  2.8e-72   1
gb|L44587|CALMTCYBF  Callithrix emiliae (clones CEM 1, CE...   785  4.0e-59   1
gb|L44588|CALMTCYBFA Callithrix jacchus (clones CJA1, CJA...   695  1.5e-51   1



Appendix 3: Technical details

The new databases may be searched using the existing ('experimental') client
that connects to a different port than the default.  The 'experimental' server
normally connects to port 5555 (service is "blast").  The new databases are
available by connecting to port 5559 (service is "xblast").  'Experimental'
clients for UNIX, using "xblast",  are available from the NCBI FTP site under
"blast/network/experimental/unix".

Blast2 clients search only the new databases and are available now on the NCBI
FTP site in blast/network/blast2.






More information about the Bio-soft mailing list