estimating K and Lambda from an extreme value distribution

Gordon D. Pusch g_d_pusch_remove_underscores at xnet.com
Fri Feb 27 17:51:22 EST 2004


Kevin Karplus <karplus at cheep.cse.ucsc.edu> writes:

> In article <gi4qtl7tg2.fsf at pusch.xnet.com>, Gordon D. Pusch wrote:
>> ranjeeva_r at yahoo.com (Ranjeeva) writes:
>> 
>>> I'm trying to fit a set of scores I get from searching  a database of
>>> 1000 amino acid sequences with a HMM. I want to calculate a p-value
>>> for each matching score. My questions are
>>> 
>>> a) How do you estimate the scalling factors K and Lambda to fit my
>>> scores (1000) to an extreme value distribution?
>> 
>> The obvious question would be: Why would you bother, since an HMM _directly_
>> yields a generative probability estimate?  Simply compare the HMM probability
>> estimate to that of a "fiducial model," e.g., the "random sequence" model. 
>> 
>> However, if you _insist_ on (ab)using extreme-value theory for this problem,
>> googling on the exact phrase "extreme value distribution" plus "fitting"
>> yields 1,740 hits.
> 
> We used to use the log P(seq|HMM)/P(seq|null) scores in the SAM HMM
> program, but found that they were not as useful as we would have
> liked, mainly because the null models are so poor. Proteins are not
> well modeled as random sequences, so some HMMs have a systematic bias,
> as many proteins get partial matches.  (For example, many proteins
> include amphipathic helices, which will provide uninteresting partial
> matches to many HMMs.)

Just to stir the pot a little about the near-universal abuse of extreme value 
theory that routinely occurs in bioinformatics: Since a "random sequence"
model underlies the derivation of the so-called "Karlin-Altschul distribution" 
used by BLAST (whose correct name is the "Gumbel distribution," since
Gumbel discovered it and the other two asymptotic classes of extreme value
distribution decades before Karlin and Altschul), should not this exact
same objection also be equally true of the "standard" P-values returned 
by BLAST --- which everyone still uses on a routine basis ???  >:-I


-- Gordon D. Pusch   

perl -e '$_ = "gdpusch\@NO.xnet.SPAM.com\n"; s/NO\.//; s/SPAM\.//; print;'




More information about the Comp-bio mailing list