estimating K and Lambda from an extreme value distribution

Kevin Karplus karplus at cheep.cse.ucsc.edu
Mon Mar 8 13:04:18 EST 2004


In article <giu117ygdn.fsf at pusch.xnet.com>, Gordon D. Pusch wrote:
> ...Except for the minor detail that, if they had bothered to actually
> _read_ Gumbel's 1958 monograph "Statistics of Extremes," they would have 
> (or at least _should_ have) immediately realized that they had committed
>=THE= classic Cardinal Sin of Extreme Value Theory, namely: Using An EVD
> Corresponding To The Wrong Class Of Asymptotic Behavior For The PDF's Tail. 
> 
> There is not one, but _THREE_ different "Extreme Value" distributions: 
> The Frechet Distribution, corresponding to the distribution of the extreme
> value of a set of IIDRVs drawn from a PDF with a power-law tail, the Gumbel
> Distribution, corresponding to the distribution of the extreme value of a
> set of IIDRVs drawn from a PDF with an exponential tail, and the Weibull
> Distribution, corresponding to the distribution of the extreme value
> of a set of IIDRVs drawn from a PDF whose domain is bounded from above.
> (NOTE: The derivation of the three families of Extreme-Value Distributions 
> assumes that the IIDRVs are _REAL-VALUED_ variables, whereas it is a common
> practice in bioinformatics to artificially _force_ the scores to be integer
> valued; in the discussion below, I will ignore this technical point,
> since IMO it does not change my primary conclusion.)
> 
> Now, in any problem of bioinformatic interest, one is aligning a query
> sequence with a FINITE number of characters against a database of sequences
> all of which also have a FINITE number of characters; furthermore, each
> character in the query sequence can be aligned with at _MOST_ one character
> in each database sequence. Hence, if all the entries in the scoring matrix 
> are finite, and all the "gap penalties" are finite, then it is trivially
> obvious that the score of the alignment must be a sum of a finite number 
> of finite terms, less a finite number of finite gap penalties, and therefore
> must be finite. Furthermore, a small amount of additional thought will show
> that the alignment score cannot possibly exceed the smaller of the two
> scores obtained by aligning the query sequence with itself and the database
> sequence with itself. It therefore follows that the set of scores obtained
> by aligning a query sequence with every sequence in _ANY_ sequence database
> is bounded from above, with an upper bound given by score of the alignment
> of the query sequence with itself.  Since the distribution of possible
> scores is bounded from above, it therefore immediately and trivially follows 
> from extreme value theory that the correct extreme value distribution 
> is the _WEIBULL_ distribution, =NOT= the Gumbel distribution. The Gumbel
> distribution, by contrast, would require a distribution of possible scores
> that is _UNBOUNDED_, with an exponential tail; this makes absolutely no sense
> unless one assumes that =BOTH= the query =AND= the database sequence can in
> principle be _INFINITELY LONG_ with a geometric (exponential) length
> distribution, and that one is _IGNORANT_ of the query sequence length and
> composition at the time of the query --- which is patently absurd!

I think I see your argument, but I'm very confused by it.  I was under
the distinct impression that a Gumbel distribution was a log of a
Weibull distribution.  (See, for example,
http://www.xycoon.com/Weibull.htm) Neither one is finite, but one is
bounded on one side by 0.  The scores of alignments fit the Gumbel
assumptions much better than they do the Weibull assumptions.  I
believe that the tails of the Weibull distribution do no fit the
empirically determined scoring distributions very well.  Since we are
mainly interested in using the models to extrapolate significance way
out beyond where it can be reliably measured, this lack of fit is
sufficient reason not to use Weibull distributions.

> 
> The error committed by Karlin and Altschul is that, being Frequentists, 
> the only way they could think of to treat the alignment problem was to
> (incorrectly!) map it onto the alignment of a _SEMI-INFINITE SEQUENCE 
> WITH ITSELF_, which Karlin had earlier solved using "Martingale" theory.
> However, contrary to Altschul's invalid hand-waving claims at
><http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html> 
> where he attempts to argue that the assumption of a "semi-infinite"
> sequence can be justified by imagining that all the sequences in the
> database are "concatenated end to end," in fact it is _FALSE_ to assume 
> that the limit that the number of sequences in the database becomes
> unboundedly large is equivalent to assuming that the _length_ of the 
> query and database sequences become unboundedly large. The Karlin-Altschul
> distribution provided the `answer' to the _WRONG STATISTICAL QUESTION_: 
> It is a good example of what happens when all one has is a hammer, 
> and then proceeds to falsely assume that "Everything Is A Nail."

The problem of finite-length query sequences was not ignored by Karlin
and Altschul.  Indeed, BLAST has a correction to the E-value
computation based on query sequence length and database sequence
length.  This correction turns out to be insignificant unless one is
using very short sequences (which some people do).

>> A good paper to read is 
>> @article{improved-psiblast-2001,
>> 	author={Sch\"affer, Alejandro A.  and Aravind, L. 
>> 		and Madden, Thomas L. and Shavirin, Sergei
>> 		and Spouge, John L. and Wolf, Yuri I.
>> 		and Koonin, Eugene and Altschul, Stephen F.},
>> 	title ="Improving the accuracy of {PSI-BLAST} protein database
>> 	searches with composition-based statistics and other refinements",
>> 	journal="Nucleic Acids Research",
>> 	volume=29, number=14,
>> 	year=2001,
>> 	pages="2994-3005"
>> 	}
> 
> ...Which unfortunately _still_ is solving the wrong statistical problem,
> since it is implicitly assuming that one knows _NOTHING_ about the properties
> of the query sequence, but is instead comparing pairs of sequences drawn 
> "at random" !!!

One can calibrate differently for each query---indeed I do that for my
fold-recognition server, but it is quite expensive.  The approach of
assuming that the query is randomly drawn makes it possible to
calibrate once for all queries, which is an important concern if you
are putting together a program to do millions of queries.  (I only
have to handle up to 1000 queries a week, so can afford the much more
expensive option of calibrating for each query separately.)


-- 
Kevin Karplus 	karplus at soe.ucsc.edu	http://www.soe.ucsc.edu/~karplus
life member (LAB, Adventure Cycling, American Youth Hostels)
Effective Cycling Instructor #218-ck (lapsed)
Professor of Biomolecular Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
Affiliations for identification only.




More information about the Comp-bio mailing list