re ci, ri, tree length

Ted Schultz ts15 at cornell.edu
Tue May 16 16:59:07 EST 1995


In article <gbga96-150595165329 at b-cohen2.genetics.gla.ac.uk>,
gbga96 at udcf.gla.ac.uk (jonathan sheps) wrote:

> In article <3p716i$4qk at mserv1.dl.ac.uk>, "Essop, FM, Dr"
> <MFESSOP at chempath.uct.ac.za> wrote:
> 
> > I have noticed Mark Siddall's response to my "naive" questions 
> > regarding ci's , ri's etc.   For his information, I have been doing 
> > various analyses on my data set (with Hennig86) and found the results 
> > confusing.  This confusion led to my questions regarding the EXACT 
> > meaning of these indices.  When I performed such analyses at 
> > different error values, Hennig86 produced different trees.  The 
> > problem I've got is one of being totally objective in my analysis.   
> > Which tree is the correct tree ?   I can actually select a tree to my 
> > fancy - isn't this subjective ?   In the light of THESE observations, 
> > I raised my questions as to what EXACTLY these values mean.  Where is 
> > the cut-off value ?  How then should one "decide" what the best tree 
> > is ?  These questions have unfortunately not been answered.  
> 
> 
> I think Siddall gave you the exact meaning of these terms, but CI and RI
> describe the homoplasy levels in the tree, they can't really be used to
> compare trees generated from different data sets (though RI is closer to
> being able to do this). For any given data set the shortest tree should
> usually be the best one. How much longer than than the shortest you still
> consider a 'reasonable" tree is up to you. I'm not sure what you mean by
> "error values", but these must be altering your data set, and so giving
> different trees. If so then you will have a family of shortest trees for
> each error value, and how you choose the best error value I don't know.

Methods have been suggested for obtaining significance levels for trees as
well as for subtrees (i.e., for branches supporting groups in given trees)
under the parsimony criterion.  Others may disagree, but I believe that
some (if not all) of these methods are related in that they are based on
the criterion of character congruence.

In this way these significance methods are also related to ci and ri,
which are also measures of character congruence.  A high ci or ri value
implies that there is little conflict in the data, i.e., most of the
characters support the same tree topology.  A low ci or ri value implies
that the characters disagree about tree topology.  Under the assumption
that our confidence in a tree or subtree increases as the agreement among
the characters increases, the following methods (among others) have been
proposed:

1. PTP (Permutation Tail Probability) (J.W. Archie. 1989. Sys. Zoo.
38:239-252.  D. Faith and P. Cranston.  1991.  Cladistics 7:1-28.)  In
this test, all the features of the real data matrix are retained (number
of characters, number of taxa, number and frequency of states within each
character), but one parameter is randomized: the assignment of states
within a character to taxa.  A distribution of tree lengths is found for
repeated such permutations of the data, and it is determined whether the
treelength for the real data is significantly different from this family
of tree lengths obtained from permuted data.  If the answer is "yes," then
it is concluded that the congruence of the characters in the data
significantly departs from what might be expected simply due to
randomness, and, presumably, our confidence in the most parsimonious tree
is increased.

2. T-PTP (Topology-dependent PTP) (D. Faith. 1991. Sys. Zoo. 40:366-375.) 
This uses permutations identical to the PTP test, but for the purpose of
determining how well-supported a subtree is.  In this case the Bremer
support for a branch is determined.  (Bremer support is the number of
steps separating the length of the most parsimonious tree and the length
of the shortest tree in which the branch supporting the subtree of
interest is collapsed.)  Then the question is asked: "Does this level of
Bremer support significantly differ from the level I might expect due
simply to random congruence?"  Again, permuted matrices are generated and
the treelength difference found for the shortest tree containing that
group vs. the shortest tree in which that group is absent, allowing the
inference of a distribution of such lengths and an answer to the above
question.

3. Bootstrap (J. Felsenstein. 1985. Evolution 39:783-791.)  I doubt I'm
telling you anything new by describing how this works: the character set
is subsampled with replacement, and support for a group is determined by
how often that group appears in trees resulting from this subsampled data
sets.  This test also reflects character congruence: groups supported by
characters that conflict with other characters will be poorly supported.

I realize that none of these tests is what Dr. Essig is actually after,
i.e., a way of finding a cutoff defining a family of trees that that are
not significantly less desirable than the most parsimonious tree. 
However, they imply other possible tests that could conceivably accomplish
the goal of finding such a family of trees.  For instance, all trees that
passed the PTP test could be considered candidates for the "true" tree. 
(In practice, I suspect that this would result in retaining so many trees
that the strict consensus would look like a bush, because the PTP test is
an extremely weak test.)  Likewise, distributions of ci's and ri's could
be obtained from the permutation approach, and the ci of the most
parsimonious tree could be compared to this distribution.

I am not necessarily defending any of the tests described above.  Indeed,
there's a lot of controversy about this subject.  Furthermore, other tests
exist, including ones designed for criteria other than parsimony (e.g.,
the log-likelihood test H. Kishino and M. Hasegawa. 1989. J. Mol. Evol.
29:170).  

Maybe other readers know of more appropriate tests.

-- 
Ted Schultz
ts15 at cornell.edu



More information about the Mol-evol mailing list