>> Not sure what it is that you say is not being considered. DNAML takes
> as the base frequencies (default ones -- the user can put in their own
> values if they want, too) the average base frequencies over the sequences.
> Of course this weights different sequences as if they were independent,
> which they aren't. Optimally one would instead estimate them by
> maximum likelihood. I think PAUP* will be able to to do that. But the
> results will, I think, rarely be noticeably better that way.
>
OK, I'll try to be more precise (sorry for you folks with a slight
aversion against maths ;_)
Let's focus exlusively at one site in an sequence alignment. This
site shows a certain pattern of nucleotides (amino acids).
For the moment let us assume that this site is variable. Then we
can compute a probability P to observe this pattern, given a tree and
a model of sequence evolution M. M usually is a simple Markov model
with stationary frequencies Pi[x] where x is a specific nucleotide
(amino acid). If all sites in an sequence alignment are variable then
simply counting the frequencies of each nucleotide (amino acid) in the
data set gives a good (ML) estimate of Pi[x]. So far so good.
Let us now assume that the site examined is invariable. Then
the probability K to see the pattern is
| 0 if site shows a non constant pattern
K = |
| K[x] if pattern consists of nucleotide (aa) x
where K[x] is the frequency of nucleotides (amino acids) on
invariable site. If the prior probability to be invariable
(for a given site) is f then the total likelihood is
L = f K + (1-f) P
In the literature and the implementations that I know one
does not distinguish between K[x] and Pi[x] though both have
a completly different meaning (and probably different values).
If there are no invariable sites then Pi[x] = actual frequencies
in the data and K[x] = 0, and the other way round if all sites
are invariable. I agree that using Pi[x] for both
empirical
Pi[x] and K[x] does probably not have a critcial influence on the
final result but if there is a strong bias towards sites being invariable
then there might be a difference. I think a good way might be counting
two different sets of base composition (constant-non constant) to
get estimations of the intersting base composition (invariable-variable)
Korbinian