Africa: summary of racist special pleading (long)

arlin at arlin at
Sun Nov 24 22:03:40 EST 1991

I have made a list of problems and criticisms relating to the mtDNA
analysis as seen in Cann, et al. 1987 and Vigilant, et al. 1991, and
as discussed here in recent weeks.  My present assessment is that the
data may slightly favor African origin of the extant human mtDNA gene
pool, but that i) the mtDNA tree and analysis have very little
statistical reliability; ii) the mtDNA "Eve" may not have been _h.
sapiens_; iii) there is nothing to reliably rule out the non-African
origin of parts of the remainder of the genome, and thus (from ii and
iii) no reason to rule out introgression of _h. sapiens_ with regional
populations of archaic _h. sapiens_ or _h. erectus_.

The comments fall into five general categories:

1. mistaken inferences about gene pool or species from mtDNA
2. mistken notions about migration
3. problematic inferences about geographic locations of ancestors
4. possibly tendentious calibration of the mtDNA clock
5. low reliability of the mtDNA tree and its rooting
1.  The problem of inferring gene pool or (worse) species origin
from mtDNA origin.

I recently posted an analogy (see "Out of Africa: mtDNA does not tell
the whole story") showing why the gene pool and species origins cannot
be inferred directly from the mtDNA origin.  Much to their credit, the
Berkeley group (as evidenced in Vigilant, et al, 1991) is now
restricting their conclusions to mtDNA, and stressing the importance
of seeking data from nuclear markers.  I am not aware of the current
state of affairs with nuclear markers, though I recall that Africans
as a group are generally thought to carry greater nuclear genetic
diversity, an observation that accords well with the out-of-Africa
hypothesis (though genetic diversity is most certainly not as good an
indicator as a tree would be).

2.  Melodramatic model of migration.

There seems to be a tacit assumption that migrations are rare and
difficult odysseys.  However, the proximate barrier separating Africa
(in which the oldest _h. sapiens_ fossils, from South-Central Africa,
date to about 100,000 YA) from Eurasia (in which the oldest _h.
sapiens_ fossil, from Palestine, dates to 92,000 YA-- BTW the
radiometric date estimates are not statistically significantly
different) is the Sinai peninsula.  And, of course, different sites
within Africa and within Eurasia are separated by mountains, rivers,
and large distances, all of which migrants must cross.

Ecologists and biogeographers must be laughing at the melodramatic
mode of migration evinced by Cann, et al., who ACTUALLY USED THE WORD
"EXODUS", and Jones of bionet.molbio.evol fame, who used the word
"odyssey" in one of his politically correct diatribes. Their treatment
of migration evokes a resolute tribe of early _homo_ packing up their
bags, saying good-bye to their beloved homeland in Rhodesia, and
setting out on a 10,000 mile, decades-long journey to China.  If we
are to think of human progenitors as animals, we have no compelling
reason to propose that they went on long journeys into unknown regions
for the explicit purpose of getting from point A to point B, or for
the purpose of putting *great* distances between themselves and the
parent population.  Migrating groups in the present context represent
advancing waves of peripheral sub-populations, exploring un-exploited
territory, either by random diffusion or by directional pressure
exerted by territoriality or scarce food resources.  Rather than
coming in discrete packages, migration is a subtle process involving
physical displacement and consequent gene flow over many generations
(and we have many *thousands* of generations to work with in
explaining human biogeography).

3.  The problem of inferring mtDNA origin from geographic location of
extant lineages.

The method used by the Berkeley group seems to be based on the
assumption that migrating populations represent bottlenecks so extreme
that the populations can be treated as lineages (an assumption that is
related to the melodramatic migrations problem).  On the
contrary, migrations do not on theoretical grounds have to involve
these diversity-restricting bottlenecks and there is some experimental
evidence that they do not.  [The Native American mtDNA analysis of
Ward, et al.,  (1991,  PNAS 88: 8720) purports to show that their
ancestors had mtDNA diversity far pre-dating (by 30,000 years or more)
their migration into North America 20,000-30,000 YA.  This paper
warrants further investigation, however, since it may involve a
clock-calibration problem].  If migrating groups carry substantial
diversity, then there is no one-to-one correspondence between
geographic character state changes (e.g., *Africa* to *not Africa*) in
SINGLE LINEAGES and historical migration events (one-to-one
correspondence is the assumption made by the parsimony analysis).

The most recent attempt to shore up the 1987 Cann, et al. paper is
"Ancestral geographic states and the peril of parsimony" (Wilson,
Stoneking, and Cann, 1991, Syst Zool 40: 363).   This paper in fact
has some nice diagrams illustrating how sampling effects can lead to
false inferences, which the authors use to suggest that geographic
parsimony is an invalid procedure.  They fail to realize that exactly
the same sorts of sampling effects can lead to false inferences in
phylogenetic analyses of all types of data (e.g., DNA sequences), not
just geographic states.  The Berkeley group seems to have reached a
vague realization that the parsimony method is misleading, yet they
don't seem to understand why: as a solution they offer the
"hypergeometric test," which is unrelated to this problem, as far as I
can tell, and (more importantly) is inherently flawed (see below, #5).

4.  Calibration of mtDNA clock.

This problem has not been discussed in this group recently, and I
confess to being inadequately informed (so I'll just outline the
problem here).  The Berkeley group's choice of 4 MYA for the
chimp/human divergence put their African "Eve" at 200,000 YA, but
accepting the paleontological estimate of 9 MYA would put her at
upwards of 400,000 YA.   A different clock calibration thus would
change the overall outlook of the problem quite extensively: an
African ancestor of 200,000 YA is more likely to have been _h.
sapiens_ than _h. erectus_, while an African ancestor of 400,000 YA is
likely to have been _h. erectus_, or at least a very archaic _h.
sapiens_.  With a different clock, the mtDNA tree may represent the
spread of _h. erectus_ from Africa, rather than the spread of  _h.
sapiens_.  It would seem unfair that the paleontological chimp/human
date of 9 MYA, which would make "Eve" less likely to be _h. sapiens_,
is held suspect by the Berkeley group, since the logic of their
arguments depends very strongly on *other* paleontological dates that
they do not similarly question.

5.  Phylogenetic analysis.  

There are several problems relating to flawed methodology and
seriously flawed statistics.  It should be noted that the mtDNA raw
data set has been improved somewhat: Cann, et al. had only restriction
map comparisons, whereas the more recent data of Vigilant, et al. are
DNA sequences.   Neither data set speaks for itself, however, no
matter how clean: the data must be analyzed using phylogenetic
inference methods, and flawed methods can only give questionable

A. The mtDNA tree was rooted _ex post facto_ by Vigilant, et al.
Standard procedure is to *include* the outgroup in the analysis, since
including it can change the topology of the remaining parts of the
tree.  The same criticism applies to Cann, et al's and Vigilant, et
al's attempt to find the shortest length "non-African" tree by
swapping a branch on the _ex post facto_ rooted parsimony tree.  I see
no reason (and no reason was offered by the authors) to suppose that
this procedure will find the shortest "non-African" tree.

B.  The actual tree length/topology combination shown by Cann, et al.
could not be replicated by Maddison, 1991 (Syst. Zool 40: 355), who
claims to have found shorter trees from the same data.  If one also
takes into account Maddison's claims that that Cann et al. failed to
use the parsimony method of choice for restriction sites (i.e., Dollo
parsimony), and that analysis of ancestral geographic states for the
most parsimonious trees from the Cann, et al data support *both*
African and non-African origins, one begins to get the impression that
Cann et al. did not know what they were doing when they ran PAUP on
their data in 1987.   Hopefully the situation has improved since
then.  However, I for one am not going to be placing any bets on the
Vigilant, et al. 1991 tree until I see the work replicated somewhere
else.  Please note in fairness that these comments should not be taken
as an explicit criticism of the Afro-genetic hypothesis or the
research group involved-- its really a criticism of the failure of the
molecular evolution community to educate and monitor with regard to
phylogenetic inference methods.

C. Regardless of how the trees come out when the data are re-analyzed,
the statistical methods used by Cann, et al. and by Vigilant, et al.
cannot justifiably be used to assess their significance.

i) The "winning sites" test employed by Cann, et al. and later by
Vigilant, et al was misapplied.  AS WITH ANY STATISTICAL TEST, THE
the dice, see that it is 3 and 5, and then say "gosh! isn't it
surprising that I got a 3 and a 5!  The two largest prime numbers in
the set (1,2,3,4,5,6)!"   But this is just what the authors did.
*After* parsimony analysis, they took one of the trees in the
*shortest* class and showed it to be "significantly" better than
another tree ("non-African") derived by making a major branch swap.
Even if the data were random, we would still expect that the trees
favored by parsimony would be shorter than other trees!  Note,
however, that the "winning sites" test is not inherently flawed: it
could have been properly applied if they had specified the precise
alternative topologies to be tested BEFORE doing parsimony-- an
opportunity no longer available to them.

ii) In contrast to the "winning sites" test, the "hypergeometric" test
can never be applied to the topology of a parsimony tree: it is
completely bogus.  The theoretical reason for this is not easy to
explain briefly, but it can be demonstrated intuitively (as was done
in a previous posting).  If you take an unrooted tree with the same
topology as that shown on p. 1505 of Vigilant, et al., and randomly
place roots on it (throw darts at it, then assign a root to the link
nearest the dart), you will find that the distribution of African and
non-African clades adjacent to the root does not have the
hypergeometric distribution.  Instead, lineages in all parts of the
tree tend to be clustered, with a greater than random tendency for
African clades to stick together, and also for non-African clades to
stick together. That birds of a feather stick together is a universal
phenomenon in phylogeny, but it is ignored by the hypergeometric
test!!  The test says nothing about the reliability of the root
placement within the tree (there are other places on the tree where a
root would have been completely surrounded by non-Africans)-- it only
tells us that clades are clustered with regard to geographic state, so
that even random roots will tend to fall within clusters.

On top of this, Vigilant, et al. have once again failed to identify
hypotheses to be tested PRIOR to carrying out the analysis.  They
might have argued that, given 31 African clades and 24 non-African
clades, the 5% significance level would be represented by getting more
than 5 African clades adjacent to the root, while the 1% level would
be represented by getting more than 8. Instead, AFTER observing 14
African lineages next to the root, they calculate the probability of
getting 14 or more.  Correcting this problem would make them better
statisticians, but the hypergeometric test would still be invalid, as
explained above.

In general, I can sympathize with their hapless position, since they
have too many short branch lengths (many of the internal branches
are fractions of a substitution in length) and too many OTUs to do
bootstraps on pre-specified clades.  However, the shortness of the
branch lengths indicates the nature of the problem: any single tree
topology simply isn't at all significant.  The authors have tried
intuitively to argue that certain *classes* of trees are significant
outcomes of the analysis, but the mathematics for doing statistical
analysis on classes of related trees (of the type that Cann, et al.
and Vigilant, et al. want to do) is not presently available. It has
only confused the situation to make up _ad hoc_ statistical tests.  We
can only hope that more appropriate tests will be devised in the

Arlin Stoltzfus

Arlin at

More information about the Mol-evol mailing list