problem with blast

Warren Gish gish at host.nlm.nih.gov
Fri Jan 22 11:07:30 EST 1993


In article <1993Jan21.231458.19512 at medmail.stanford.edu> wnelson at cmgm.stanford.edu (Will Nelson) writes:
>I have been having a problem with blastn.
>The problem is that on successive invocations of blastn,
>I get different results, using the same input sequence.
>
>My input file is this:
>
>
>>DROSATA - LOCUS       DROSATA       254 bp ds-DNA             INV       15-MAR-1989
>canatttgcaaatttaatgaaccccccttcaaaaaatgcgaaaattaacgcaaaaattgatttccctaaa
>tccttcaaaaagtaaataacaactttttggcaaaatctgattccctaatttcggtcattaaataatcagt
>ttttttgccacaactttaaaaataattgtctgaatatggaatgtcatacctcgcnnagctngtaattaaa
>tttccaatgaaactgtgttcaacaatgaaaattacatttttcgg
>

Dear Will,

What you have observed is a consequence of the ambiguous 'n' letters present
in the query sequence.  An analogous phenomenon can also arise when a database
sequence contains ambiguity codes.  BLASTN searches a compressed form of the
database and, to parallel this, it also uses a compressed form of the query.
In compressed form, letters other than A, C, G, and T are not permitted in the
sequences.  What BLASTN does with Ns is replace them with random selections
from the set {A,C,G,T}.  For the other IUB ambiguity codes, random selections
are made from the appropriate subset of {A,C,G,T}.  For example, any Rs would
be replaced by random selections from the set {A,G}.

As you may know, the alignments found by BLASTN can be scored by counting the
number of matches and mismatches, multiplying these two numbers by the
corresponding match reward (default value +5) and mismatch penalty (default
value -4), and adding them together.  Depending on the random replacements that
were made at each position of ambiguity, alignments found in different
invocations of BLASTN may have different initial scores; and/or the alignments
may have different start- and end-points in the query and database sequences.
(An alignment is not supposed to begin or end on a mismatch, as might be
encountered where a random replacement was made).

After the database search is finished and the one-line descriptions are
reported, the alignments themselves are then reported.  It is at this point
that the original query and database sequences, including any ambiguity codes
that may be present, are used by BLASTN to re-score the alignments.  When a
final score is different from the initial score, its value is flagged with an
asterisk pointing to the WARNING footnote that will appear at the end of BLASTN
output.  (The initial alignment is not trimmed, however, should it be
subsequently be found to begin or end with one or more mismatches).

Pruned example:

                                                                     Smallest
                                                                     Poisson
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N
 
DROSAT353  D.melanogaster 1.688 g/ml satellite DNA sequence.   438  1.4e-27   1


>DROSAT353 D.melanogaster 1.688 g/ml satellite DNA sequence.
           Length = 353
 
  Plus Strand HSPs:
 
 Score = 429* (118.5 bits), Expect = 8.1e-27, P = 1.4e-27
 Identities = 97/111 (87%), Positives = 97/111 (87%), Strand = Plus
 
Query:   141 TTTTTTGCCACAACTTTAAAAATAATTGTCTGAATATGGAATGTCATACCTCGCNNAGCT 200
             |||| ||| ||||||||||||| ||||||||||||||||||  |||||| ||||  ||||
Sbjct:    62 TTTTCTGCTACAACTTTAAAAACAATTGTCTGAATATGGAAACTCATACGTCGCTGAGCT 121
 
Query:   201 NGTAATTAAATTTCCAATGAAACTGTGTTCAACAATGAAAATTACATTTTT 251
              ||||||||||||||||| ||||||||||||| |||| |||||| |||| |
Sbjct:   122 CGTAATTAAATTTCCAATCAAACTGTGTTCAAAAATGGAAATTAAATTTCT 172
 
<stuff deleted>

WARNING:  *12 alignments contained non-ACGT(U) letters.



While this behavior is described in the BUGS section of the BLAST manual page,
the emphasis in that document is on potential cases where the initial score
satisfies the cutoff for reporting matches but the final score falls below the
cutoff.  Worst-case, it is also possible for completely different sets of
matches to be reported in different invocations of BLASTN when ambiguity codes
are present.

Sincerely,
--Warren




More information about the Bio-soft mailing list