estimating K and Lambda from an extreme value distribution
Kevin Karplus
karplus at cheep.cse.ucsc.edu
Tue Feb 24 12:18:10 EST 2004
In article <gi4qtl7tg2.fsf at pusch.xnet.com>, Gordon D. Pusch wrote:
> ranjeeva_r at yahoo.com (Ranjeeva) writes:
>
>> I'm trying to fit a set of scores I get from searching a database of
>> 1000 amino acid sequences with a HMM. I want to calculate a p-value
>> for each matching score. My questions are
>>
>> a) How do you estimate the scalling factors K and Lambda to fit my
>> scores (1000) to an extreme value distribution?
>
> The obvious question would be: Why would you bother, since an HMM _directly_
> yields a generative probability estimate? Simply compare the HMM probability
> estimate to that of a "fiducial model," e.g., the "random sequence" model.
>
> However, if you _insist_ on (ab)using extreme-value theory for this problem,
> googling on the exact phrase "extreme value distribution" plus "fitting"
> yields 1,740 hits.
We used to use the log P(seq|HMM)/P(seq|null) scores in the SAM HMM
program, but found that they were not as useful as we would have
liked, mainly because the null models are so poor. Proteins are not
well modeled as random sequences, so some HMMs have a systematic bias,
as many proteins get partial matches. (For example, many proteins
include amphipathic helices, which will provide uninteresting partial
matches to many HMMs.)
Although we worked on improving the null models in various ways, we
still found it useful to calibrate the resulting HMMs to provide an
interpretable E-value.
Our current scheme does not use extreme-value distributions, but a
rather ad hoc family of symmetric distributions for the
reverse-sequence null model. This approach works well for some
alphabets, but not others (it does very well on amino-acid HMMs, but
terribly on the protein-blocks alphabet). Results and methods are in
a paper we've submitted to Bioinformatics---I hope it comes out this
year.
We're looking into calibration using a different null model that will
require fitting the tails of extreme-value distributions.
--
Kevin Karplus karplus at soe.ucsc.edu http://www.soe.ucsc.edu/~karplus
life member (LAB, Adventure Cycling, American Youth Hostels)
Effective Cycling Instructor #218-ck (lapsed)
Professor of Biomolecular Engineering, University of California, Santa Cruz
Undergraduate and Graduate Director, Bioinformatics
Affiliations for identification only.
More information about the Comp-bio
mailing list