estimating K and Lambda from an extreme value distribution
Gordon D. Pusch
g_d_pusch_remove_underscores at xnet.com
Thu Mar 4 12:57:28 EST 2004
Kevin Karplus <karplus at cheep.cse.ucsc.edu> writes:
> In article <giu11czfui.fsf at pusch.xnet.com>, Gordon D. Pusch wrote:
>> Just to stir the pot a little about the near-universal abuse of extreme value
>> theory that routinely occurs in bioinformatics: Since a "random sequence"
>> model underlies the derivation of the so-called "Karlin-Altschul distribution"
>> used by BLAST (whose correct name is the "Gumbel distribution," since
>> Gumbel discovered it and the other two asymptotic classes of extreme value
>> distribution decades before Karlin and Altschul), should not this exact
>> same objection also be equally true of the "standard" P-values returned
>> by BLAST --- which everyone still uses on a routine basis ??? >:-I
>
> The derivation of the Gumbel distribution may not be rock
> solid---indeed there are problems with the null model assumptions used
> in BLAST, but the authors of BLAST have continued to improve the
> composition and length corrections, and have good empirical evidence
> that the Gumbel distribution is a good fit to their scores.
...Except for the minor detail that, if they had bothered to actually
_read_ Gumbel's 1958 monograph "Statistics of Extremes," they would have
(or at least _should_ have) immediately realized that they had committed
=THE= classic Cardinal Sin of Extreme Value Theory, namely: Using An EVD
Corresponding To The Wrong Class Of Asymptotic Behavior For The PDF's Tail.
There is not one, but _THREE_ different "Extreme Value" distributions:
The Frechet Distribution, corresponding to the distribution of the extreme
value of a set of IIDRVs drawn from a PDF with a power-law tail, the Gumbel
Distribution, corresponding to the distribution of the extreme value of a
set of IIDRVs drawn from a PDF with an exponential tail, and the Weibull
Distribution, corresponding to the distribution of the extreme value
of a set of IIDRVs drawn from a PDF whose domain is bounded from above.
(NOTE: The derivation of the three families of Extreme-Value Distributions
assumes that the IIDRVs are _REAL-VALUED_ variables, whereas it is a common
practice in bioinformatics to artificially _force_ the scores to be integer
valued; in the discussion below, I will ignore this technical point,
since IMO it does not change my primary conclusion.)
Now, in any problem of bioinformatic interest, one is aligning a query
sequence with a FINITE number of characters against a database of sequences
all of which also have a FINITE number of characters; furthermore, each
character in the query sequence can be aligned with at _MOST_ one character
in each database sequence. Hence, if all the entries in the scoring matrix
are finite, and all the "gap penalties" are finite, then it is trivially
obvious that the score of the alignment must be a sum of a finite number
of finite terms, less a finite number of finite gap penalties, and therefore
must be finite. Furthermore, a small amount of additional thought will show
that the alignment score cannot possibly exceed the smaller of the two
scores obtained by aligning the query sequence with itself and the database
sequence with itself. It therefore follows that the set of scores obtained
by aligning a query sequence with every sequence in _ANY_ sequence database
is bounded from above, with an upper bound given by score of the alignment
of the query sequence with itself. Since the distribution of possible
scores is bounded from above, it therefore immediately and trivially follows
from extreme value theory that the correct extreme value distribution
is the _WEIBULL_ distribution, =NOT= the Gumbel distribution. The Gumbel
distribution, by contrast, would require a distribution of possible scores
that is _UNBOUNDED_, with an exponential tail; this makes absolutely no sense
unless one assumes that =BOTH= the query =AND= the database sequence can in
principle be _INFINITELY LONG_ with a geometric (exponential) length
distribution, and that one is _IGNORANT_ of the query sequence length and
composition at the time of the query --- which is patently absurd!
The error committed by Karlin and Altschul is that, being Frequentists,
the only way they could think of to treat the alignment problem was to
(incorrectly!) map it onto the alignment of a _SEMI-INFINITE SEQUENCE
WITH ITSELF_, which Karlin had earlier solved using "Martingale" theory.
However, contrary to Altschul's invalid hand-waving claims at
<http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html>
where he attempts to argue that the assumption of a "semi-infinite"
sequence can be justified by imagining that all the sequences in the
database are "concatenated end to end," in fact it is _FALSE_ to assume
that the limit that the number of sequences in the database becomes
unboundedly large is equivalent to assuming that the _length_ of the
query and database sequences become unboundedly large. The Karlin-Altschul
distribution provided the `answer' to the _WRONG STATISTICAL QUESTION_:
It is a good example of what happens when all one has is a hammer,
and then proceeds to falsely assume that "Everything Is A Nail."
> A good paper to read is
> @article{improved-psiblast-2001,
> author={Sch\"affer, Alejandro A. and Aravind, L.
> and Madden, Thomas L. and Shavirin, Sergei
> and Spouge, John L. and Wolf, Yuri I.
> and Koonin, Eugene and Altschul, Stephen F.},
> title ="Improving the accuracy of {PSI-BLAST} protein database
> searches with composition-based statistics and other refinements",
> journal="Nucleic Acids Research",
> volume=29, number=14,
> year=2001,
> pages="2994-3005"
> }
...Which unfortunately _still_ is solving the wrong statistical problem,
since it is implicitly assuming that one knows _NOTHING_ about the properties
of the query sequence, but is instead comparing pairs of sequences drawn
"at random" !!!
> Of course, the real reason that people routinely use the BLAST
> e-values is not because the auhtors of BLAST have been very careful to
> make the e-values as accurate as they can (although they have been
> careful), but because many biologists have blind faith in their
> computational tools.
...Or at least, in the _authors_ of their computational tools... >:-I
Furthermore, in my experience, many of the biologists I've worked with
do not even assume that the E-values or P-values return by such tools are
"statistically accurate" --- their level of statistical sophistication
stops at the level of "Small P-values Are Good, and Big P-values Are bad."
(Indeed, one biologist I worked with did not even =LOOK= at the P-values
or the E-values returned by BLAST or FASTA, but simply assumed that
any sequence "near the top of the list" was "closely related," whereas
any sequence near the bottom of the list was "probably unrelated" ---
in spite of having been _REPEATEDLY_ told that the set of hits in this
particular display had been "truncated" to show _ONLY THE BEST HITS
FROM EACH ORGANISM IN THE NON-REDUNDANT DATABASE_, and that in principle
it was possible for =ALL= the hits in such a "truncated" display to be
"significant"!
> One would expect wet-lab scientists to have a healthy scepticism of any
> results, knowing how often experiments fail, and how much bad data has
> made it out into the literature, but many seem to have an almost mystical
> faith in anything produced by computation.
This sort of "mystical faith in the results of computation" is not limited
to "wet-lab" biologists: I saw much the same sort of "mystical faith"
exhibited by _experimental physicists_, toward the results of the
"Monte Carlo" simulations they made of the "expected backgrounds"
of their high-energy and nuclear physics experiments !!! 8-(
> (On the other hand, computational people seem to have an almost mystical
> faith in wet-lab verification---expecting experiments to be neat, quick
> deterministic tests like "if" statements in code.)
<*shrug*> Programmers tend to think "deterministically," not "experimentally."
(What fraction of the programmers that you know also have training in
Experimental Analysis --- or even in basic statistical theory ???
Now consider that, given your research subfield, the odds are that this
fraction is _MUCH_ larger than the fraction among programmers in general...)
-- Gordon D. Pusch
perl -e '$_ = "gdpusch\@NO.xnet.SPAM.com\n"; s/NO\.//; s/SPAM\.//; print;'
More information about the Comp-bio
mailing list