Felsenstein's Maximum Likelihood phylogeny inference algorithm (in
PHYLIP) requires a base pool to model evolution. By default, the program
(DNAML or DNAMLK) uses base frequencies estimated from the actual
composition of the sequences used to reconstruct the tree. But it is my
impression that this 'base pool' is intended to reflect the pool of
available nucleotides in the cell, from
which the wrong nucleotide may be occasionally chosen and incorporated
into the DNA. If a sequence is under selection for the conservation of a
peptide product, then it will not necessarily reflect the base pool. A
neutral region of sequence that has reached an equilibrium determined by
the rates of misincorporation of different bases should reflect that pool.
But I don't think I have such a region. Perhaps 3rd position sites in
an ORF would approximate this? Or would it make sense to use DNAML
iteratively to find the base pool that yields the highest likelihood (for
a given data set and tree)? What if neutral regions seem to drift to
very high AT content? Does this mean that there are almost no G or C
nucleotides in the base pool? Or does this phenomenon have something to
do with the 'ease of misincorporation' differing between nucleotides (and
could that be absorbed into the base pool model)? Am I putting too fine
a point on a model that was only meant to be a reasonable approximation
in the first place?
I'm sure that people more clever than I have thought about this.
Observations, clarifications, references, and especially answers will be
appreciated. Argument as well.
University of Utah