> We have been told by the likes of David Penny and others that taking the
> number of invariable sites into consideration when calculating pairwise
> distances and performing Max. Likelihood calculations is important. The
> suggestion has been to use maximum likelihood to estimate the number of
> positions that are likely to be invariable (as opposed to simply
> constant). We should then remove the requisite number of invariable
> sites such that the dataset now has the EXPECTED number of constant
> sites.
OK, but the problem is which sites are the invariable sites. The
likelihood method (as e.g. used in the HKY 1985 paper) uses a
parameter f corresponding to the probability that a given site is
invariable (= fraction of invariable sites among all sites).
In this way all positions are examined and the problem of
selecting constant sites is circumvented. The value for f that
maximizes this likelihood function is the ML estimate of f and
is smaller or equal than the fraction of constnt sites.
In theory, if you would know the constant sites you could drop them
and you'd get f = 0.0. In practise, you don't know, and you have
to live with the complete alignment and f > 0.0.
>> So far so good. We just instruct the program to multiply the entries
> aa, cc, gg, tt (in the following table) by the
> proportion-of-constant-sites-assumed-to-be-invariable (am I correct?).
I don't understand what these entries are but the parameter usually
used is the fraction of invariable sites among all sites
(what you have is the conditioned probability of beeing invariable
given that a site is constant)
> What about when you are bootstrapping the dataset?
I guess you are talking of bootstrapping ML trees (you intrduction!).
For the ML you estimate your f parameter once with the complete
data set. Then you simply do bootstrapping with the whole data
set and with a fixed f. Don't remove sites unless you know for
sure that they are invariable (if f = fraction of constant sites
you can remove all constant positions, of course).
If you are of NJ trees than you should take care that you have
maximum likelihood distances where f is incorporated.
Finally, I'd like to mention the practcal problem: though the
ML method with invariable sites is old (1985) I don't know
how to do it in practise with any published program (please
tell me if you know such a ML estimation program!)
In the ML program that Arndt and I am developing we will
incorporate such an estimator but as there are also other
features in the works it'll take some time until we'll give
it away. (If you like, there is a prliminary test version that
can estimate f but don't rely to much on it because it still
is very buggy:
http://www.zi.biologie.uni-muenchen.de/~strimmer/future.html )
Hope this helps,
Korbinian