# [Molecular-evolution] Re: F81 distance formular

Joe Felsenstein joe at removethispart.gs.washington.edu
Fri Mar 24 17:15:19 EST 2006

In article <mailman.294.1143220158.16885.mol-evol at net.bio.net>,
Rex Lau <rex.lau at bio.usyd.edu.au> wrote:
>   Could anyone give me the direction or hints on 'how to derive the F81
>model distance formula' from its probability formula ?? The F81 distance is
>given as follows :
>
>D = -(1-([P_A]^2 + [P_C]^2 + [P_G]^2 + [P_T]^2))ln(1-(P/([P_A]^2 + [P_C]^2 +
>[P_G]^2 + [P_T]^2)).
>
>  The probability formula for F81 is given as:
>
>  P(ij) =  P_j + (1-P_j)*Exp[-at]    for all i == j
>           P_j*(1-Exp[-at])          for all i != J
>
>where P_j is the equilibrium value for the frequency of each nucleotide.
>      Exp is the exponential function.
>      at  is the expected rate of change per unit time.
>
>  I use the idea given by Li's book [Molecular Evolution - The method to
>derive Juke-Cantor model distance based on sequence similiarity (p.69 and
>p.80] for F81 model, but it doesn't come out that what i expected.

You might think I would know, but when I defined the F81 model in 1981
I never actually derived a distance formula for it -- I was using the model
in maximum likelihood analysis of phylogenies, for which distances aren't
involved.

So let's see.  If there is to be an F81 distance, it should be the
maximum likelihood estimate of divergence "time" ( = branch length) for
the F81 model, for two species.  If there are  n_{ij} sites that have
base i in the first sequence and base j in the second sequence, the
likelihood is

L  =  \prod_{ij}  P(ij)^n_{ij}

looking at the transition probabilities:
>  P(ij) =  P_j + (1-P_j)*Exp[-at]    for all i == j
>           P_j*(1-Exp[-at])          for all i != J

or in their original, and equivalent, form (Felsenstein, 1981, eq. 7)

P(ij) =  Exp[-at] + (1-Exp[-at]) p_j    for  j = i
P(ij) =          (1 - Exp[-at]) p_j     for all  j != i

we can take the log of the likelihood, and plug in the formulas for
P(ij):

ln L  =  \sum_{ij}  n_{ij} ln (\delta_{ij} Exp[-at] + (1-Exp[-at])p_j)

where \delta_{ij} is the familiar Kronecker Delta (1 if i=j, otherwise 0),
taking a = 1 and taking the derivative of this with respect to  t,

\partial ln L                               (-\delta_{ij} + p_j)
-------------  =  \sum_{ij}  n_{ij}  -------------------------------------
\partial t                         \delta_{ij} Exp[-t] + (1-Exp[-t])p_j

all we need to do is equate that to 0 and solve for t.  (There is the
issue of whether we are given the p_j or are estimating them, in which case
we have to take derivatives with respect to them too).  I think the solution
is possible when all the p_i are equal, but I don't see what it is more
generally.

Has someone somewhere claimed the simple formula you gave above?  Can it
be derived?  I don't think I ever gave it.

----
Joe Felsenstein         joe at removethispart.gs.washington.edu
Department of Genome Sciences and Department of Biology,
University of Washington, Box 357730, Seattle, WA 98195-7730 USA