Making alignments

Guy A. Hoelzer hoelzer at
Mon Jan 26 13:13:36 EST 1998

In article <6adkcu$sk7 at>, newsmgr at wrote:

> Guy A. Hoelzer wrote:

> > I think the question of robustness to model assumptions is still a wide
> > open question.  At this point, everone I talk to seems to think that their
> > approach (MP vs. ML) is more robust, but I am not aware of any direct
> > comparisons.  

> Actually there are tons of comparisons in the molecular evolution
> literature.  Just look for papers by Hillis, Huelsenbeck, Kuhner and
> Felsenstein, Yang, Goldman.  Many of their recent papers deal with how
> well the different methods perform when simulations are performed under
> different sets of models (that is performance under conditions where the
> model of simulation is the same or different from the model used for ML
> or distance correction).

I am familiar with many papers from these authors, but I do not know of
any that directly compare robustness of different tree estimating
algorithms to deviations from model assumptions.  If you know of a
specific reference that does this, I would be quite intereted.

> I think all methods, including ML, will give the correct answer when
> "clean hierarchy" is present.

I agree, in general; but, it is certainly possible that filtering "clean
hierarchy" through a complex model can distort and erode that hierarchy. 
In practice, this could happen to a user of ML.  The "distorting" model
would be intended to reveal hierarchy that might have been distorted by
the evolutionary process, but that strategy can backfire when the original
pattern is clear.

> The question should be therefore which
> methods are likely to outperform other methods when homoplasy starts to
> decay the hierarchy.  The problem of MP and simpler models of evolution
> (used in distance and ML analysis-- e.g. Jukes & Cantor) is that
> particular biases in the way homoplasy occurs will be "overlooked".  For
> instance if there is heterogeneity in rates at sites and the model used
> in the method ignores this process, then there will be systematic
> underestimation of branchlengths during the ML calculations.  This in
> turn can lead to the familiar long-branch attraction phenomenon (if
> generalized rates in different lineages are different). Unweighted MP
> implicitly assumes that all sites are evolving in more or less the same
> way-- thus similar problems can arise.  Simple distance corrections will
> underestimate pairwise distances under conditions of rate
> heterogeneity......However, if rate heterogeneity is built into the
> model of evolution, then parameter estimates will not necessarily be
> biased and the chances of getting the correct tree increase.  

Nicely described!  I agree with all of this; although, I would add the
caveat that we do not know which model is the right one to use in
practice.  IMHO, the tests recommended for use with ML analysis that are
designed to guide one to use of the best model are flawed.

>         In this context, "inductive" procedures (by which I assume you mean
> estimating parameters for rate heterogeneity and estimating
> branchlengths etc.) potentially allows biases in the way homoplasy
> occurs to be accounted for.  Signal can then be detected over this
> "noise" which obscures the hierarchy you describe.

Assuming mutational saturation has not gone too far and you know the true
evolutionary model to use for the data, I agree.  Unfortunately, we never
know this information with certainty, and cannot measure the degree of
certainty with currently available procedures due to the inductive nature
of ML.  By induction, in this context I mean that the choice of model and
parameter values
is a statement about the natural world, but they are only tested against
the original data.  Then, assuming the model, we try to make statements
about the world outside of the data.  Without comparing the data to
something derived from the natural world we cannot make probablistic
statements about the analytical results.

>         I do agree with you that ML should not be portrayed as a "cure all". 
> The problem with using ML is that one must estimate many many parameters
> from, what is usually, a small amount of data. The more complex the
> model used, the more parameters must be estimated.  This surely
> increases the "random error" in the phylogeny estimation, which, I
> expect, would decrease the efficiency of the method. I'm not sure what
> simulations have shown in this regard -- but I'm guessing that in cases
> where the overall divergence between sequences is high and there are few
> alignment positions, methods such as weighted parsimony or some simple
> distance method will outperform ML methods.  

> Does anyone know if this is true?

This is an interesting prediction and I, too, would like to hear the
answer if it has been tested.

Guy Hoelzer                              e-mail:  hoelzer at
Department of Biology                    phone:   702-784-4860
University of Nevada Reno                fax:     702-784-1302
Reno, NV  89557

More information about the Mol-evol mailing list