# African Eve

David Swofford swofford at uxh.cso.uiuc.edu
Wed Feb 26 15:18:05 EST 1992

```robison1 at husc9.harvard.edu (Keith Robison) writes:

>knox at spruce.gsfc.nasa.gov (Robert Knox) writes:
>>If you're going to use the PAUP
>>program on a moderately large data set, use a fast
>>computer and let it run for a long time!

>Whoa! Be careful here.  The real message is that if you use PAUP,
>make MANY different runs.  According to the Science articles, the
>problem arose because the 100 trees reported in the Vigilant et al
>paper came from a single run of PAUP.  Unfortunately, all trees
>from a single run of PAUP are inherently related.  No matter how
>long you let a SINGLE run last, this will be true.  The only way
>around it is to make multiple runs.
>(I'm basing all this on the articles -- I've never used PAUP).

As the author of PAUP, I think I'd better clarify things.  The basic
problem is that with large, noisy data sets, there can be several
(or many) optimality peaks.  The heuristic search algorithms used
by PAUP and other parsimony programs use a hill-climbing approach.
That is, given a starting tree somewhere on the landscape, they
strive for more parsimonious solutions by rearranging the starting
tree in a prespecified way.  If a rearrangement is succesful (i.e.,
took us further up the slope we were on), then we try to rearrange that
tree, and so on.  Eventually, we reach a point (the peak) at which
no rearrangement improves the fit of the tree(s), and we stop.
Hopefully, that peak represents a global optimum, but it may be only
a local optimum.

Now suppose that there is another set of equally parsimonious trees
such that no tree in this second set can be obtained by a sequence of
single-step rearrangements of any tree in the first set.  If we conduct
the rearrangement process (loosely called "branch swapping") from only
one starting tree, we can at best find only those trees in one set
or the other.  This is the essence of Vigilant et al.'s mistake: the
trees from which they drew their conclusions were all a single
rearrangement away from one starting tree, and therefore constituted
a very nonrandom sample of the trees that are equally parsimonious.

PAUP deals with this problem by providing a "random addition sequence"
option.  When this option is used, starting trees are constructed by
adding taxa sequentially to a developing tree in a (pseudo)random
order.  The effect of this procedure is that a wide variety of
starting trees can be input to the branch rearrangement cycle.

David Maddison, Maryellen Ruvolo, and I reanalyzed the Vigilant et
al. data using PAUP's random-addition sequence.  (Alan Templeton and
Hedges et al. papers in the above-cited Science letters did the
same, but their searches were not nearly as thorough as ours.)
I will omit the details (our paper will appear in the next issue of
Systematic Biology, formerly Systematic Zoology).  Basically, there
are many thousands of trees as parsimonious or nearly as parsimonious
as the one published by Vigilant et al. that do not support an African
origin.  We emphasize in our paper that we are not challenging the
notion of African origin of human mtDNA, merely that this conclusion
is not supported by the data of Vigilant et al.  We also challenged
the validity of their statistical test based on the hypergeometric
distribution, but I won't go into that now.

Now, permit me to adopt a slightly defensive posture with respect to
PAUP.  Keith Robison's comment should have said "If you use PAUP *OR
ANY OTHER PROGRAM FOR ESTIMATING PHYLOGENETIC TREES* make many
different runs."  Any program that uses a hill-climbing search
strategy is subject to the multiple optima problem.  The difference
is that PAUP has a way to directly address the problem; no other
parsimony program (to my knowledge) does.  PHYLIP (by Joe Felsenstein)
allows the option of a random addition sequence (input order), but there
is no facility for combining trees from multiple runs (i.e., if one
run generates 1000 trees and another generates 500 trees, there is no
(easy) way to determine whether any of the trees from the 500-set match
any of the trees from the 1000-set.  PAUP, on the other hand, does
an arbitrary number of random-addition-sequence replicates but only
adds new trees to the set if they do not match trees found in earlier
replicates.  This approach also speeds things up considerably, as
once a tree is found in replicate N that matches a tree found in
an earlier replicate, we can abandon that replicate immediately, as
we know that all other trees on the same peak will have already been
found.  The only other widely used parsimony program, Hennig86 by
J.S. Farris, does not even provide an input-order randomization option.
The only way to find many multiple peaks with Hennig86 is to manually
reorder the lines of the data matrix yourself before handing the data
file to the program, and the above-mentioned problems of PHYLIP still
pertain.

by David R. Maddison (1991): The discovery and importance of multiple
islands of most-parsimonious trees.  Syst. Zool. 40:315-328.  In
addition to providing a clear description of the problem, he documents
several cases (WARNING: shameless, self-indulgent plug coming) in which
PAUP finds many more equally parsimonious trees than does Hennig86.
--
David L. Swofford                 Phone:    (217)244-6959
Illinois Natural History Survey   FAX:      (217)333-4949
607 E. Peabody Drive              BITNET:   DAVESWOF at UIUCVMD
Champaign, Illinois 61820 USA     Internet: swofford at uxh.cso.uiuc.edu

```