In article <patersoa.53.000BF4BA at lincoln.ac.nz>, patersoa at lincoln.ac.nz
(Adrian Mark Paterson) summarized several suggestions on how to
compute Decay Indices (or Support Indices) including my own. My post
to Adrian was intended to help him with practical aspects of computing
"SI" values, but I am happy to also respond to Joe Felsenstein's
comments on my comparison of SI and bootstrap values, because I should
have anticipated that my comments might have been posted to this
news group as well as to their intended recipient. As long as it did
get posted here, please tolerate the following quite lengthy
clarifications.
Two notes of clarification on the mechanics of SI calculation.
First, I think that I misspoke when I mentioned to Adrian that I thought
PAUP 4 will calculate SI values. I'm not sure where I got this impression.
Second, the method that my software makes semi-automatic is the one
suggested by Andrew Mitchell in the original posted summary: converse
constraint searches employing PAUP. The method suggested by Alice Hempel
and Jim Manhart will work only for SI values up to about 3 or so, depending
on the data set. As they noted, one will eventually run out of memory as
the number of trees explodes. The converse constraint approach should also
be more efficient with respect to computer time. I provide a somewhat
lengthy example below which can be followed as an example if you don't
want to download my large software package, just to do this.
My software takes an input Nexus file with minimally a tree block such as:
BEGIN TREES;
TRANSLATE
1 Thylacinus,
2 Sarcophilus,
3 Dasyurus,
4 Echymipera,
5 Trichosurus,
6 Phalanger,
7 Philander,
8 Bos
;
UTREE Fig._2_tree = ((((1,(3,2)),((5,6),4)),7),8);
END;
and will output the following:
begin assumptions;
[Nodes for tree Fig._2_tree (tree 1)]
taxset t1_n3= 1 3 2 5 6 4;
taxset t1_n4=1 3 2;
taxset t1_n5=3 2;
taxset t1_n6= 5 6 4;
taxset t1_n7=5 6;
end;
begin paup;
[Constraints for nodes in tree Fig._2_tree (tree 1)]
constraints ct1_n3=((t1_n3));
constraints ct1_n4=((t1_n4));
constraints ct1_n5=((t1_n5));
constraints ct1_n6=((t1_n6));
constraints ct1_n7=((t1_n7));
end;
begin paup;
log file = 'si_calcs.log' append;
[Constraint search blocks created by DNA Translator stack on
4/25/95, 3:12 PM]
SET AUTOCLOSE;
SET MAXTREES = 100;
SET INCREASE=AUTO;
SET NOERRORSTOP;
SET NOWARNRESET;
SET [NO]BACKGROUND;
SET [NO]STATUS;
SET [NO]CHECKEVTS;
[SET [NO]DISPLAY;] [You may want to change this to speed up processing.]
[Uncomment the above line for PAUP <= 3.1.1]
[SET [NO]MONITOR;] [You may want to change this to speed up processing.]
[Uncomment the above line for PAUP >= 4]
SET OUTROOT=PARAPHY; [You may prefer the default ³Polytomy²]
[!Following search uses ct1_n3 constraint]
showconstr ct1_n3;
BANDB enforce constraints = ct1_n3 converse;
describe 1 /noplot;
[!Following search uses ct1_n4 constraint]
showconstr ct1_n4;
BANDB enforce constraints = ct1_n4 converse;
describe 1 /noplot;
[!Following search uses ct1_n5 constraint]
showconstr ct1_n5;
BANDB enforce constraints = ct1_n5 converse;
describe 1 /noplot;
[!Following search uses ct1_n6 constraint]
showconstr ct1_n6;
BANDB enforce constraints = ct1_n6 converse;
describe 1 /noplot;
[!Following search uses ct1_n7 constraint]
showconstr ct1_n7;
BANDB enforce constraints = ct1_n7 converse;
describe 1 /noplot;
log stop;
end;
as an example with the default options chosen except that a
branch-and-bound search was specified (with a greater number of
taxa, use the default "HSEARCH ADDSEQ=RANDOM NREPS=10" settings).
You will be given the option of appending it to your data file so that you
can perform all SI searches without interruption by simply executing
your data file. Then you need to scroll through the log output and subtract
the unconstrained length from the length of each constrained search
to get the SI value for each corresponding node. Note that nodes 1
and 2 are not appropriate for SI or bootstrap values. (How many
times have you seen someone present a tree with an ingroup supported
by 100% bootstrap value with a single outgroup?)
Concerning my comparison of SI and bootstrap values, it was not my
intention to be overly critical of the bootstrap. A bootstrap value
might be a useful heuristic, even if not a precise confidence interval,
for node robustness. I happen to prefer SI values for reasons given.
It is true that neither high bootstrap or SI values will necessarily
mean that a node is robust because the node reflects historical truth.
Other systematic biases in the data set (e.g., frequent T->C
transitions at sites free to vary) could be an alternative explanation.
>>SI values vs. Bootstrap values:
>>>>A high SI generally corresponds to a high bootstrap value (with some
>>infrequent exceptions)
>>This seems true but little is yet known about statistical properties of
>SI. Perhaps Doug would argue that statistical inference is not the right
>framework for thinking about this anyway ...
[Text that might suck me into a sticky methodological and political debate
respectfully deleted for now.]
Yes it is true that many do not view historical inference as a problem
of statical inference. The arguments against applying statistics to
historical reconstruction usually concern the difficulty of generalizing
about how historical events occur. You mentioned Farris, but he must not be
totally against bootstrap estimations because he provides very efficient
calculation of them in his program RNA. Of those I know who prefer
viewing historical reconstruction as a problem of statistical inference,
who might for example prefer maximum likelihood to parsimony, several are
extremely critical of the bootstrap as either a measure of reliability or
repeatability. You have already discussed why it gives high values when a
method is inconsistent.
>> but:
>>>>1. SI calculations are based on exactly the same data set as your
>>parsimony search (i.e., they are based on all available evidence).
>>So are bootstrap values.
>
What I meant was that each replicate search has some characters represented
more than once and others not represented at all. Each data matrix is
potentially a unique character matrix. This is not true of SI searches,
where all data is used.
>>2. SI calculations have a more direct and intuitive relationship to
>>the "robustness" of nodes.
>>My intuition works differently, I guess.
>
I will stand by my characterization of SIs as more intuitive than a
bootstrap estimate (the shortest tree that does not include this node
is 3 steps longer than the minimum length tree vs. this node
has a bootstrap support of 87%). Admittedly, SI values get harder
to interpret across analyses when differing weighting schemes are
employed, but they are still of value for comparison of nodes within
a tree.
>>3. SI calculations do not confound statistical support for the reality
>>of a particular tree with a separate issue of how likely the same tree
>>would be obtained if you had more characters from the same universe
>>of characters, whether or not it is the "true" tree.
>>I am not sure how SI achieves this. This seems to be a statement that
>SI is not misled by inconsistency problems the way bootstraps will be. But
>see assertion 5 below.
>
The SI is difficult to confuse with a statistical measure of confidence.
That is all I meant. You will have to admit that there are countless
articles in molecular biology journals where authors have assumed that
the bootstrap is a confidence statement for the reality of a node,
never considering the possibility that other factors besides similarity due
to history might account for the node robustness (i.e., they did not read
your articles on bootstrap calculations).
>>4. Some of the assumptions of a bootstrap analysis are frequently violated,
>>for example, it depends on the i.i.d. assumptions identified by Felsenstein
>>and others, which require that the characters be identically and
>>independently distributed. Sanderson has separated these assumptions into
>>two less restrictive assumptions, namely, that characters are independent,
>>and that the observed charcter set is a "representative" sample of the
>>"universe of characters" (paraphrased from PAUP 3.1 manual, p. 56).
>>I would alter these assertions by saying that the characters are
>independent, and that they are randomly sampled from A UNIVERSE of
>characters rather than THE UNIVERSE. There is no assumption that they
>randomly sample all possible characters (if they did the method would be
>primarily of interest to angels on heads of pins) but only that they
>are drawn randomly from some large set of possible characters. Thus two
>studies, one using osteological characters, and one using behavioral ones,
>could both (separately) use the bootstrap without the difference in the
>universes from which they draw creating a problem. Independence is the
>more problematic assumption, actually.
>>>5. Whether or not SI values differ from bootstrap analyses in overcoming
>>systematic biases such as "long branch effects" has not been exhaustively
>>explored, but it might be at least advisable to try both, rather than
>>limit your estimates of node robustness to bootstrap values alone (the
>>most common practice).
>>Well, see my comment under 3 above. I also wonder whether advocates of
>SI/decay-indices would be willing to say that it might be advisable to
>try bootstrapping too, rather than limit your estimates of node robustness
>to SI alone?
>
I just suggested trying both, even though they are largely redundant.
Journals such as MBE demand in their instructions to authors that all figured
trees should include bootstrap estimations. Am I alone in thinking that this
is a bit severe when SI values are an acceptable alternative? I may have
been the only author to so far get away with publishing only SI values on
my trees in MBE, but it wasn't easy.
>I also note that SI can be done on many other methods of inferring phylogenies,
>(for example seeing how much increase there is in the sum of squares
>in a distance method when one bans a given branch). Only in the case of
>likelihood does it have a direct connection to statistics.
>>I have not put SI/decay into PHYLIP, not because it isn't worth doing but
>just because we have not yet got constraints for/against a given group
>built in yet, for technical and organizational reasons.
>>Sorry about the theological warfare here, but the literature on things
>like foundations of inferring phylogenies and criticisms of bootstraps
>is in a funny state right now, with a lot of oral tradition and not
>many clear treatments in journal articles.
>
I agree, but please don't discourage people from trying something that is
relatively new as I believe that you have by framing this as theological
warfare.
>-----
>Joe Felsenstein joe at genetics.washington.edu (IP No. 128.95.12.41)
> Dept. of Genetics, Univ. of Washington, Box 357360, Seattle, WA 98195-7360
John Huelsenbeck wrote:
>One thing that bothers me about decay indices is what to make of a particular
>value for a clade. Say that a particular clade will "decay" at three steps.
>Is that a large number or a small number? In other words, should I put a lot
>of confidence in clades that decay at three, four, five, ... steps?
>
Why does it bother you to have to relate node robustness back to hypotheses
of character evolution?
>Faith has proposed randomization of the original matrices to determine
>the null distribution of the decay indices (his T-PTP tests; we need to talk
>to him about acronyms...). However, passing the "is it random" test seems
>like a pretty easy test for most data sets to pass.
We don't know much about the SI because it has been too tedious to calculate.
Hopefully, the method suggested above will help in that respect.
Doug
--
Doug Eernisse <DEernisse at fullerton.edu>
Dept. Biological Science MH282
California State University
Fullerton, CA 92634