ML HMM and site patterns

Joe Felsenstein joe at evolution.genetics.washington.edu
Fri Apr 19 18:20:06 EST 1996

In article <4l7t2q$d2v at lyra.csx.cam.ac.uk>,
Korbinian Strimmer  <strimmer at zi.biologie.uni-muenchen.de> wrote:
>Joe Felsenstein and Gary Churchill have recently given a concise description
>(Mol. Biol. Evol. 13:93-104) of how to compute the likelihood function if
>the rates of each site vary and if these rates are evolving along the
>sequence as a Markov chain.
>To speed up the calculation one has of course only to compute each Pn(v)
>only once. One therefore compresses the data in a way that each site
>pattern occurs only one time, and assigns a weight to this site pattern:
>             w(1)       w(2)            w(m)
> L(v) = P1(v)    * P2(v)     ... * Pm(v)      (over all m site patterns)
>This scheme save *a lot* of computations.

Yes, there is.

>The likelihood then reads like this 
>        -           -                 -
>        \           \                 \
> L(v) = / f   P1(v) / M     P2(v) ... / M       Pn(v)  (over all sites)
>        -  c1       -  c1,c2          -  cn-1,cn
>        c1          c2                cn
>If one looks into DNAML 4.0 (alpha) one can see that Joe does indeed compress
>the input data but I don't see where to start with the saving ... Maybe he
>uses the compressed data only for the single rate case ?

No, I use it in all cases.

Actually we did the same thing in version 3.5c, the current distribution
version (which has the Hidden Markov Model stuff too).  The equation above
can be used, except that of course one has instead of Pi(v) a set of them,
one per rate, so the equation you wrote should have been:

       _              _                    _ 
       \              \                    \
L(v) = / f   P1(v,c1) / M     P2(v,c2) ... / M       Pn(v,cn)
       -  c1          -  c1,c2             -  cn-1,cn
       c1          c2                cn

Now we need for each site a set of P(v,ci).   But the interesting thing
is that if two sites have the same nucleotide pattern, they have the
same P(v,ci)'s.   So we can "compress" the sites, compute the P(v,ci) for
each compressed site (there will be one per possible rate), and

Gary and I didn't explain the compression (aliasing) saving in the paper
but it is there.  For data sets with closely related species, or with
few species and many sites, this is a substantial saving.

Joe Felsenstein         joe at genetics.washington.edu     (IP No.
 Dept. of Genetics, Univ. of Washington, Box 357360, Seattle, WA 98195-7360 USA

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net