Changes to the BLAST Databases

Tom Madden madden at corin.nlm.nih.gov
Tue Feb 20 11:38:57 EST 1996


			
			Changes to the BLAST Databases

				Feb. 20, 1996


This announcement describes a reorganization of the databases available for 
BLAST searches at the National Center for Biotechnology Information (NCBI).  
The same sequence data will be available for searching but will be organized 
for more efficient searching and will be better synchronized with the Entrez
databases.

The major differences will be the elimination of EST and STS sequences from
the 'nr' (non-redundant database) and the introduction of a database ('month')
containing only the sequences added over the past 30 days.  Another change 
is a new definition line for protein sequences.

WWW Blast and E-mail Blast users will switch to the new set of databases
beginning March 11, 1996.  Since most users search 'nr', the change should be
minimal since the database name will stay the same, but EST and STS sequences
will not be searched. 

For users of Network Blast, a new client (Blast2) is being introduced that will
not only search the new set of databases, but also provide a better interface
for post-processing search results.  Blast2 represents the future direction
of the Blast service and users of the existing Blast software, known as the
'Experimental' Blast service are encouraged to upgrade to Blast2.  However,
the existing 'Experimental' Blast clients will be able to operate with the
new databases. (See Appendix 2 for technical details).  Blast2 clients are
available now for FTP and users of the 'Experimental' Blast clients are able
to use the new databases now.  Beginning March 11, 1996, the old databases
will no longer be available and both the Experimental and Blast2 clients will
use the new databases.

These changes are described further below in the following topics:

   * New databases

   * Sequence identifiers

   * The Blast2 service 

   * A new Entrez-based e-mail server

   * Databases on the FTP site

	Comments about these changes are welcome, please send them to
blast-help at ncbi.nlm.nih.gov.  For information about other NCBI services,
send e-mail to: info at ncbi.nlm.nih.gov


===========================================================================

New Databases:

	Presently both the old and the new databases are available.  The 
old databases will be available until March 11, 1996, at which time only the 
new databases will be available.  The new databases are now searchable with
the Network version of Blast (see Appendix 2).

 New nucleotide databases:

 nr	Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's or STS's)
 est	Non-redundant Database of GenBank EST Division
 sts	Non-redundant Database of GenBank STS Division
 vector	Vector subset of GenBank
 mito	Database of mitochondrial sequences, Rel. 1.0, July 1995
 kabat	Kabat Sequences of Nucleic Acid of Immunological Interest
 month	All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last
	30 days

 New protein databases:

 nr	Non-redundant GenBank CDS translations+PDB+SwissProt+PIR
 spdb	Non-redundant SwissProt+PDB sequences
 kabat	Kabat Sequences of Proteins of Immunological Interest
 month	All new or revised GenBank CDS translation+PDB+SwissProt+PIR sequences 
	released in the last 30 days

===========================================================================

Sequence identifiers for the new databases:

	The one-line descriptions for GenBank conceptual translations will 
change.  The present descriptions describe the conceptual translation of a CDS 
in terms of the GenBank flatfile, but do not reliably point to a specific
CDS if the order or number of CDS features changes.  An example is:

        "gp|U04987|SIU04987_4   env gene product [Simian immunodef...";

"SIU04987_4" indicates that this protein is the fourth CDS on the entry with
the accession U04987.  Changes to the GenBank entry can change the order and 
number of CDS features.

Therefore, in order to identify the specific protein sequence NCBI is now
assigning a stable identifier, called a 'gi' for all sequences.  A "gi" is
a unique integer that changes when the sequence changes.  It does not change,
however, if only the features or references of an entry are updated.
The new format for protein sequences will contain the identifer 'gi'
followed by the 'gi number':

        "gi|451623           (U04987) env [Simian immunodeficiency..."

Although the accession number of the translated nucleotide sequence will 
appear in the header line (U04987 in the example above), retrieval by 
'gi number' is the only reliable method to locate the correct translated 
DNA sequence.  The new e-mail retriever (see below) or Entrez may be used 
to retrieve sequences identified by "gi".

Additional examples of definition lines are provided in Appendix 1.


===========================================================================

The Blast2 service:

Blast2 is the newest version of the BLAST client software and represents
the foundation for NCBI's future development of the BLAST service. 
The Blast2 service permits BLAST searches with a number of
different clients for different platforms, available on the NCBI FTP site.
These clients can be obtained by FTP'ing to ncbi.nlm.nih.gov (login as anonymous
and cd to blast/network/blast2).  In contrast to the present BLAST service (designated 
"experimental"), these clients communicate with the BLAST server through 
a structured interface, allowing BLAST to interface better with other programs,
e.g., post-processing programs.  The blast2 service already uses the new databases.  
Although Blast2 is expected to eventually replace the 'experimental' Blast
clients, NCBI will continue to support the 'experimental' Blast client for
the near future.

===========================================================================

New Entrez-based e-mail retrieve server ("QUERY"): 

QUERY uses the Entrez Query Engine to obtain data. Entrez can retrieve
data by domain (i.e., nucleotide or protein) rather than by source database.
QUERY can retrieve entries by "gi" (see above) and is synchronized with the 
new BLAST databases.  To receive documentation about this service, 
send an email to "query at ncbi.nlm.nih.gov".  The body of the message 
should consist of the word "help" (without quotes).

===========================================================================

Databases on the FTP site
	
	All the databases listed above are available as FASTA files from the
NCBI FTP site (ncbi.nlm.nih.gov).  These FASTA files are not necessary to 
perform BLAST searches using the BLAST clients discussed here.  They are 
only needed if one wishes to run the actual BLAST search engines in-house, 
rather than sending BLAST queries to the NCBI.  To obtain these files, FTP
to ncbi.nlm.nih.gov, login as anonymous and cd to "blast/db".  These files 
are compressed and should be FTP'ed in binary mode.
	
	A FASTA file ("genpept.fsa") containing all the proteins in the 
GenBank release will also be available from the NCBI FTP site, in the directory
"genbank".  This file will not be available until Feb. 26, 1995.  The one-line 
descriptions in this FASTA file have the same format as those presented in 
Appendix 1.  This file serves as a replacement for the file "genpept.fasta", 
which will be discontinued on March 11, 1996.

===========================================================================


Appendix 1: Examples of sequence header lines in Blast output:

Protein:

gi|808969            (V00383) reading frame [Gallus gallus]    641  4.6e-99   2
gi|763101            (V00387) seventh exon [Gallus gallus]     690  2.6e-90   1

(note: gi numbers used for GenBank translated sequences; other protein sequences
are designated according to database of origin, e.g., Swiss-Prot, PDB, PRF).

sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED). ...  1191  3.0e-159  1
sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED). ...   949  2.7e-126  1
pdb|1OVA|A           Ovalbumin (Egg Albumin) >pdb|1OVA|B ...   645  1.3e-99   2
prf||0705172A        ovalbumin [Gallus gallus]                 645  1.3e-99   2


Nucleotide:

gb|U37104|APU37104   Aethia pusilla cytochrome b gene, mi...  1672  1.2e-133  1
gb|U37087|ACU37087   Aethia cristatella cytochrome b gene...  1627  5.7e-133  2
emb|F19596|HSPD04201 H.sapiens mitochondrial EST sequence...   997  3.9e-77   1
emb|F19081|HSPD03679 H.sapiens mitochondrial EST sequence...   939  2.8e-72   1
gb|L44587|CALMTCYBF  Callithrix emiliae (clones CEM 1, CE...   785  4.0e-59   1
gb|L44588|CALMTCYBFA Callithrix jacchus (clones CJA1, CJA...   695  1.5e-51   1




Appendix 2: Technical details

The new databases may be searched using the existing ('experimental') client
that connects to a different port than the default.  The 'experimental' server
normally connects to port 5555 (service is "blast").  The new databases are
available by connecting to port 5559 (service is "xblast").  'Experimental'
clients for UNIX, using "xblast",  are available from the NCBI FTP site under
"blast/network/experimental/unix".

Blast2 clients search only the new databases and are available now on the NCBI
FTP site in blast/network/blast2.





More information about the Bio-soft mailing list