"Logic of Cladistics"
Bruce Rannala
rannala at MINERVA.CIS.YALE.EDU
Fri Jun 10 12:48:15 EST 1994
In the interests of polite conversation, I would generally avoid this topic,
along with religion and politics. However, I think Felsenstein is correct in
his assertion that the question of the logic underlying so-called "cladistic
approaches" to phylogeny reconstruction has been largely ignored by the
status quo mostly because of the unsettling outcomes (or lack of outcomes).
I offer a few of my own observations on these questions, and my apologies
for any nihilistic tendencies. The "classical" method of maximum-likelihood,
due to R.A. Fisher, is generally applicable in cases where the family of
possible distributions is known, apart from a finite number of unknown real
parameters. Often, there may exist a unique set of parameter values that is
most likely (i.e., maximizes the log-likelihood function), although this is
by no means always the case. So the key feature here is that the family of
"possible" distributions on the sample space is known (there are, of course,
all sorts of other possible complications involving measure-theoretic
considerations for discrete and continuous random variables which we
biologists generally ignore).
A practical outcome of this constraint is that some "model" of evolution is
generally needed to postulate the form of the distribution of character
states over the sample space and estimate branch lengths or other parameters
of interest. An example of a possible model of gene frequency change is the
stochastic process known as "Brownian motion."
Many cladists of the Elliot Sober school get rather upset over
simple models of evolution, such as "Brownian motion." In many cases, this
is probably justified. However, the alternative they espouse is to adopt a
"statistic," parisimony-based minimum-branch-lengths to decide among trees
without understanding any of the properties of this statistic. At this
point, Occam's Probative is hurried in to save face. So the important
question is how well-behaved the cladist's statistic really is? Under what
sorts of evolutionary models does parsimony work well, or not so well? A
number of authors including Felsenstein, Hillis, Nei and Penny (to name only
a few), have tried to answer this question by evaluating the efficiency of
parsimony methods under several different models of the evolutionary
process, and also empirically using a known phylogeny (I believe it was for
bacteria). The bottom line? Realistic evolutionary models tend to be
multivariate stochastic processes that defy analytical solution, and closed
form expressions for the character state distributions (i.e., distribution
functions or probability density functions) are generally unavailable.
It seems to me that there are three avenues from here: (1)
increasingly complex computer simulations, inspired by research on
evolutionary mechanisms, that attempt to evaluate the statistical properties
of various phylogenetic estimators; (2) tracking real (perhaps
artificially-accelerated) evolution in those organisms for which this is
possible (mainly bacteria and viruses) and evaluating the statistical
properties of the methods empirically (this would be tedious and allow for
few generalizations to other species; each species might require a different
estimator due to its "different" evolution); (3) developing phylogeny
estimation methods with statistical properties that do not depend on any
particular family of distributions over the sample space. For example,
least-squares estimators require no knowledge of the form of the
distribution of the error vector, apart from the mean and variance matrix.
Recent methods of "partial-likelihood" might also be helpful here (please
don't ask me to explain PL methods as I am no expert, see your local stats
professor).
It is very telling that so few professional statisticians have
ventured into the phylogenetics controversy (compare this with the field of
theoretical population genetics which has attracted some of the most
brilliant probabilists of this century: Bartlett, Feller, Karlin, Kolmogorov
and Moran to name only a few). I would guess that the reason is that the
mathematical issues are still very poorly defined in the field of
phylogenetics. If we biologists were able to clarify our thinking; if we
were able to decide what exactly we are trying to achieve with our
phylogenetic methods, and to consolidate our views on what constitute valid
evolutionary models, then we might have some hope of interesting the
mathematical types and some real progress in the theory of phylogeny
estimation might be made. Obviously, as Siddall suggests, the place to begin
is with sequence data, since single-locus genetic models are generally much
more tractable than quantitative genetic models, and require fewer assumptions.
I think that there is light at the end of the tunnel, but much of
the current methodology in phylogenetics is bound to become obsolete. My
advice to an ambitious young cladist would be, don't hitch your wagon to
tightly to any particular train; change is, after all, the indication of a
healthy scientific field. Buddhism and Christianity have both far-outlived
Newtonian physics.Please direct any replies to this news-group, rather than
my email address.
Bruce Rannala, Department of Biology, Yale University
rannala at minerva.cis.yale.edu
More information about the Mol-evol
mailing list