testing if seqs. are in same phylo. tree?

Huelsenbeck ednah at mws4.biol.berkeley.edu
Thu Nov 9 21:13:50 EST 1995

In article <beetle-0911950936310001 at bembidion.agforbes.arizona.edu>,
beetle at ag.arizona.edu (David Maddison) wrote:

> I am involved in an analysis of some sequences, and it
> is unclear if all of them really do share a history.  
> That is, some of them may not actually be part of the
> same phylogenetic tree, and may represent independent
> derivations of the same function.
> I'd like to know what literature is out there that
> deals with methods of testing to see if sequences 
> really share a phylogenetic history.  What papers
> are people aware of on this issue?
> Thanks,
>    David
> -- 
> David R. Maddison
> Department of Entomology
> University of Arizona
> Tucson,  AZ  85721
> beetle at ag.arizona.edu

Hi David,

It sounds as if you want to test whether estimates from different data
partitions (e.g., genes) are significantly different (more different than
would be expected by stochastic variation).  There are a couple of tests
available that might prove acceptable.

I assume that you have aligned sequences:

Species_1   partition_1  partition_2 ... partition_n
Species_2   partition_1  partition_2 ... partition_n
Species_3   partition_1  partition_2 ... partition_n
Species_s   partition_1  partition_2 ... partition_n

Jim Bull and I (Huelsenbeck and Bull, Systematic Biology, in press) 
propose a likelihood-ratio test (the likelihood heterogeneity test) 
to evaluate the hypothesis that differences in phylogenetic estimates 
can be explained by stochastic variation.  In our application, we 
specifically test for heterogeneity in topology (branching order) 
but the test is trivially modified to evaluate other aspects of the 
phylogenetic model.  The likelihood heterogeneity test compares the 
likelihood (L0) obtained under the constraint that the same phylogeny 
underlies all of the data sets to the likelihood (L1) obtained when 
this constraint is relaxed.  Under the null hypothesis, H0, the same 
tree is assumed to underlie the data from different genes, although 
the rates of evolution as well as other parameters are allowed to vary 
between the genes.  Not only are the overall rates (for the genes as 
wholes) allowed to vary, but the relative rates (from branch to branch 
of the trees) can also differ among genes.  Under the alternative 
hypothesis, H1, different trees as well as evolutionary rates can 
underlie each gene.  The likelihood ratio test statistic is 

d = 2(ln L1 ­ ln L0).

Because the null hypothesis is a subset of the alternative 
hypothesis, this ratio should be asymptotically distributed as a Chi 
square probability density distribution with (n ­ m) degrees of 
freedom, where n is the number of parameters under H1 and m is the 
number of parameters under H0 (Rice, 1995).  However, Goldman (1993) 
has shown that for the phylogeny problem, the Chi square distribution 
is not appropriate, and instead suggested Markov simulation of the null 
distribution to determine the critical values for d.  In the absence 
of suitable asymptotic results appropriate for all parameter values 
under the null hypothesis, the maximum likelihood values are instead 
used in the simulations.  The simulations thus assume the same tree 
for all genes but different branch lengths (and other parameter values) 
among data partitions.

I've done some very limited simulations, and it seems that the parametric
bootstrap approach does a good job of generating the null distribution.
Jim and I have also applied the method to the problem of amniote relationships.

Farris et al. (Cladistics, 1995) also proposed a test that addresses the
same problem using, of course, parsimony as the optimality criterion.  They
use as the test statistic the Michevich-Farris index:

MF = Lcombined - Sum_over_all_partitions(Li)

where L is the length of the tree for either the combined data or for the
i-th data partition.  They propose that the null distribution for this
test statistic be determined by constructing new data partitions of the
same size randomly and without replacement.  Swofford also proposed this
resampling scheme to me several years earlier and has implemented the 
method in PAUP* 4.0 (as the combinability test).  You might want to talk
with him.  The advantage of this test is that it can be applied to both
molecular and morphological data.  The disadvantage appears to be power.

I hope this is all helpful.

John Huelsenbeck
Department of Integrative Biology
University of California
Berkeley, CA  94720

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *                      
                       John & Edna Huelsenbeck
johnh at mws4.biol.berkeley.edu             ednah at mws4.biol.berkeley.edu
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net