IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Question Regarding Bootstrap

Guy Hoelzer hoelzer at unr.edu
Mon Oct 18 10:13:01 EST 1999

In article <7ucsoq$cnf at net.bio.net>, Luis <luismms at mail.telepac.pt> wrote:

> I was wandering if there is a relation between short branches and low
> bootstrap values for the adjacent nodes, it seems to me that the shorter
> the branch the lower will be the bootstrap value for the node from where
> it comes. 

That is generally the case.

> Does this always happen, 

No.  I will describe the reason using the maximum parsimony criterion, but
I believe the same pattern holds true for all tree estimating methods. 
Imagine a most parsimonious tree for many taxa.  One node is subtended by
a branch with 5 changes parsimoniously mapped onto it.  Further imagine
that each of these putative synapomorphies is uncontradicted throughout
the rest of the matrix.  The vast majority of the bootstrap replicates
will contain at least one copy from this set of 5 characters.  Each time
this occurs, the original clade will appear in the most parsimonious tree
for the bootstrapped matrix, thus it obtains a very high bootstrap value.

Now imagine another node somewhere else in the same tree, which is
subtended by a branch with 10 changes parsimoniously mapped onto it. 
However, this time there are numerous other characters supporting data
partitions that are inconsistent with the existence of this clade. 
Depending on how many of the original 10 characters that changed on the
relevant branch in the most parsimonious tree are sampled, and the number
of characters supporting other particular data partitions, the original
clade might not appear very often in the set of bootstrap trees.

> and does the contrary happen for long branches? 

As branches get longer, they first begin to accumulate more and more
synapomorphies, which is phylogenetic signal making tree estimation
easier.  However, at some point they begin to loose too much information
about their ancestral history (i.e. the plesiomorphic states for the
clade).  In addition, characters begin to change multiple times along the
branch, which increases the chance of convergent evolution.  Both of these
factors destabilize the phylogenetic position of the clade during a
bootstrap analysis.

It is also important to remember the issue of long branch attraction.  If
there are two particularly long branches (with lots of convergence on
other taxa) in the TRUE tree, then they tend to cluster together in
phylogenetic analyses.  This is basically because two random sets of
character states are more likely to resemble one another than either is to
resemble any of the non-randomly associated sets of states among the other
taxa.  Depending of the distribution of variation among the other taxa,
the wrongly clustered long branch clade can obtain high bootstrap values.

> If this is true wouldn't bootstrap go against it's first
> intention of providing a mean for estimating sampling error, since if we
> add invariable positions to  sequences the bootstrap values would
> decrease even though we increase the sampling?

First, the bootstrap is used in phylogenetics to provide a mean for
estimating sampling error, but this is not what bootstrapping was
developed for.  It is best used as a way to explore the variance structure
of a matrix.  For this reason, it does have a number of undesirable
qualities when used to provide point estimates of parameter values.  

Second, the bootstrap values will not change for a maximum parsimony
analysis when you add invariable characters; although they can change a
little using some other methods, like maximum likelihood.  For maximum
likelihood, the result is not predictable; sometimes the bootstrap values
will increase, sometimes they will decrease.

You quickly changed gears in the last sentence, from branch lengths to
sampling effort, so I am not sure exactly what you are asking here. 
However, I will mention that measures of confidence ought to be sensitive
to sample size, because the probability that you can infer the truth is
related to sample size.  You specifically mention the possibility that a
measure of confidence should never decrease with increasing sampling
effort.  This is not strictly true.  When assumptions of the estimation
procedure are violated, then the estimate can be statistically
inconsistent.  That is, the closeness of the estimate to the truth
initially increases with sampling effort, but then it peaks and asymtotes
to zero.  An appropriate measure of confidence would mirror this pattern;
thus, after peaking at some positive sample size, the measure of
confidence would decrease with increasing sample size.

Guy Hoelzer
Department of Biology
University of Nevada Reno
Reno, NV  89557

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net