Joe Felsenstein and Gary Churchill have recently given a concise description
(Mol. Biol. Evol. 13:93-104) of how to compute the likelihood function if
the rates of each site vary and if these rates are evolving along the
sequence as a Markov chain. If there is no rate variation this function
simply is the product of all the likelihoods for each site, i.e. the product
over all sites of all the probablities to see a specific site pattern
(Pn is Prob to see the site pattern of site n on the tree, v are parameters):
L(v) = P1(v) * P2(v) ... * Pn(v) (over all n sites)
To speed up the calculation one has of course only to compute each Pn(v)
only once. One therefore compresses the data in a way that each site
pattern occurs only one time, and assigns a weight to this site pattern:
w(1) w(2) w(m)
L(v) = P1(v) * P2(v) ... * Pm(v) (over all m site patterns)
This scheme save *a lot* of computations.
NOW, IS THERE A WAY TO APPLY A SIMLIAR THING IN CASE WHERE ONE
CONSIDERS RATE VARIATION??
The likelihood then reads like this
- - -
\ \ \
L(v) = / f P1(v) / M P2(v) ... / M Pn(v) (over all sites)
- c1 - c1,c2 - cn-1,cn
c1 c2 cn
If one looks into DNAML 4.0 (alpha) one can see that Joe does indeed compress
the input data but I don't see where to start with the saving ... Maybe he
uses the compressed data only for the single rate case ?
Thanks for any help!
Korbinian Strimmer