"Logic of Cladistics"

Bruce Rannala rannala at MINERVA.CIS.YALE.EDU
Fri Jun 10 12:48:15 EST 1994


In the interests of polite conversation, I would generally avoid this topic, 
along with religion and politics. However, I think Felsenstein is correct in 
his assertion that the question of the logic underlying so-called "cladistic 
approaches" to phylogeny reconstruction has been largely ignored by the 
status quo mostly because of the unsettling outcomes (or lack of outcomes). 
I offer a few of my own observations on these questions, and my apologies 
for any nihilistic tendencies. The "classical" method of maximum-likelihood, 
due to R.A. Fisher, is generally applicable in cases where the family of 
possible distributions is known, apart from a finite number of unknown real 
parameters. Often, there may exist a unique set of parameter values that is 
most likely (i.e., maximizes the log-likelihood function), although this is 
by no means always the case. So the key feature here is that the family of 
"possible" distributions on the sample space is known (there are, of course, 
all sorts of other possible complications involving measure-theoretic 
considerations for discrete and continuous random variables which we 
biologists generally ignore).
A practical outcome of this constraint is that some "model" of evolution is 
generally needed to postulate the form of the distribution of character 
states over the sample space and estimate branch lengths or other parameters 
of interest. An example of a possible model of gene frequency change is the 
stochastic process known as "Brownian motion."
        Many cladists of the Elliot Sober school get rather upset over 
simple models of evolution, such as "Brownian motion." In many cases, this 
is probably justified. However, the alternative they espouse is to adopt a 
"statistic," parisimony-based minimum-branch-lengths to decide among trees 
without understanding any of the properties of this statistic. At this 
point, Occam's Probative is hurried in to save face. So the important 
question is how well-behaved the cladist's statistic really is? Under what 
sorts of evolutionary models does parsimony work well, or not so well? A 
number of authors including Felsenstein, Hillis, Nei and Penny (to name only 
a few), have tried to answer this question by evaluating the efficiency of 
parsimony methods under several different models of the evolutionary 
process, and also empirically using a known phylogeny (I believe it was for 
bacteria). The bottom line? Realistic evolutionary models tend to be 
multivariate stochastic processes that defy analytical solution, and closed 
form expressions for the character state distributions (i.e., distribution 
functions or probability density functions) are generally unavailable. 
        It seems to me that there are three avenues from here: (1) 
increasingly complex computer simulations, inspired by research on 
evolutionary mechanisms, that attempt to evaluate the statistical properties 
of various phylogenetic estimators; (2) tracking real (perhaps 
artificially-accelerated) evolution in those organisms for which this is 
possible (mainly bacteria and viruses) and evaluating the statistical 
properties of the methods empirically (this would be tedious and allow for 
few generalizations to other species; each species might require a different 
estimator due to its "different" evolution); (3) developing phylogeny 
estimation methods with statistical properties that do not depend on any 
particular family of distributions over the sample space. For example, 
least-squares estimators require no knowledge of the form of the 
distribution of the error vector, apart from the mean and variance matrix. 
Recent methods of "partial-likelihood" might also be helpful here (please 
don't ask me to explain PL methods as I am no expert, see your local stats 
professor).
        It is very telling that so few professional statisticians have 
ventured into the phylogenetics controversy (compare this with the field of 
theoretical population genetics which has attracted some of the most 
brilliant probabilists of this century: Bartlett, Feller, Karlin, Kolmogorov 
and Moran to name only a few). I would guess that the reason is that the 
mathematical issues are still very poorly defined in the field of 
phylogenetics. If we biologists were able to clarify our thinking; if we 
were able to decide what exactly we are trying to achieve with our 
phylogenetic methods, and to consolidate our views on what constitute valid 
evolutionary models, then we might have some hope of interesting the 
mathematical types and some real progress in the theory of phylogeny 
estimation might be made. Obviously, as Siddall suggests, the place to begin 
is with sequence data, since single-locus genetic models are generally much 
more tractable than quantitative genetic models, and require fewer assumptions.
        I think that there is light at the end of the tunnel, but much of 
the current methodology in phylogenetics is bound to become obsolete. My 
advice to an ambitious young cladist would be, don't hitch your wagon to 
tightly to any particular train; change is, after all, the indication of a 
healthy scientific field. Buddhism and Christianity have both far-outlived 
Newtonian physics.Please direct any replies to this news-group, rather than 
my email address.
  
 
Bruce Rannala, Department of Biology, Yale University
rannala at minerva.cis.yale.edu




More information about the Mol-evol mailing list