IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

rate variation in ML models

Joe Felsenstein joe at evolution.genetics.washington.edu
Thu Oct 2 10:04:55 EST 1997

In article <60u3a5$sap at net.bio.net>,  <newsmgr at merrimack.edu> (Andrew Roger)

>I was wondering how the discrete gamma
>distribution is factored into likelihood calculations.  
>Are particular sites assigned a rate category in advance OR
>does every site have a certain probability of being in one
>of the rate classes (thus the calculations for each rate category
>are done for each site and summed over the whole lot- low
>probability assignments then contribute little to the overall
>probability, and high probability assignments contribute
>a lot to the overall probability).

The latter.  The likelihood for that site for a given tree is the
integral over all possible rates of the product of two terms: the
probability of that rate (taken from the gamma distribution) times the
likelihood that is achieved with that rate.

>Are all among-site-rate-variation models incorporated into
>the calculations the same way?

For any gamma distribution one does it the same way, but of course
the probabilties assigned to different rates differ.

In practice, the integral is not done (too slow) but instead all
programs evaluate the product at a series of rates and approximate the
integral as a sum.  But you can think of it as if the integral is done.
In any case, one is _not_ asssigning a single rate to each site.

>Finally, does the way that one deal with this problem affect
>whether or not the i.i.d. assumption holds for a dataset?

If there is no autocorrelation of rates among adjacent sites, the
model is still i.i.d. (independent and identically distributed).  But
if, as is allowed in my DNAML and Yang's PAML, there is some autocorrelation
among sites, then the model isn't i.i.d.   This affects, for example,
the validity of bootstrapping.  There is a variant of the bootstrap
called block-bootstrapping that would be necessary in such a case.  It
basically involves sampling N/B blocks of sites, each of length B, where
B is big enough to encompass the correlated sites.

Joe Felsenstein         joe at genetics.washington.edu
 Dept. of Genetics, Univ. of Washington, Box 357360, Seattle, WA 98195-7360 USA

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net