Hi,
I've posed this question before without reply, so here it is again.
We have been told by the likes of David Penny and others that taking the
number of invariable sites into consideration when calculating pairwise
distances and performing Max. Likelihood calculations is important. The
suggestion has been to use maximum likelihood to estimate the number of
positions that are likely to be invariable (as opposed to simply
constant). We should then remove the requisite number of invariable
sites such that the dataset now has the EXPECTED number of constant
sites.
So far so good. We just instruct the program to multiply the entries
aa, cc, gg, tt (in the following table) by the
proportion-of-constant-sites-assumed-to-be-invariable (am I correct?).
A C G T
A aa ac ag at
C ca cc cg ct
G ga gc gg gt
T ta tc tg tt
My question.....
What about when you are bootstrapping the dataset?
In any bootstrap replicate there will probably be more (or less)
constant sites than in the original dataset. The net result is that some
replicates have such low numbers of constant sites that the distance
between some pairs of sequences is incalculable (spell??).
I have found empirically that you will have less incalculable distances
during bootstrapping if you first remove the requisite number of sites
and perform regular bootstrapping (at least for one of my datasets). Is
this a more valid approach than re-adjusting the constant sites during
bootstrapping? or is this introducing some kind of bias?
hope I have been clear (hope I get an answer).
James