advice on bootstrapping values

Brian Fristensky frist at cc.UManitoba.CA
Fri Mar 20 20:35:28 EST 1998

Mike Tennant wrote:

>     I have a protein multiple sequence alignment of some very divergent
> sequences (245 sequences altogether), generated using PSI-BLAST from
> ncbi. This alignment contains quite a lot of gap regions (I'd estimate
> that about well over 50% of the positions have gaps characters in), but
> the core regions are all aligned in a sensible manner. I've bootstrapped
> the resulting tree (1000 samples) in clustalw, and seen that the values
> on the nodes in the sub-trees (those sequences which are easily related)
> are relatively high (80+ %), but the values between sub-trees can be
> very low (as low as 2%).
>     I'd appreciate it of anybody could comment on these values,
> especially as to whether they are meaningless or not. Any other advice
> on tree creation when one's MSA contains lots of gaps would also be
> welcome.

I've been puzzling over this one for awhile and welcome the 
opportunity for discussion. For an infinite number of polymorphic
sites, the percentage of trees in which a group of 
sequences cluster together should be a direct estimator of
the confidence limit for that group. To quote Joe Felsenstein
(Evolution 39:783-791, 1985) "if a group shows up 95% of the
time or more, the evidence for it is taken to be statistically
significant" (at the 5% level - B.F.) 

Herin lies the problem with bootstrapping. I'm not saying there's
anything wrong with doing it. The problem is how you make use
of the bootstrap results. On the surface, one might be tempted
to dismiss any tree or clade that didn't have
a bootstrap value or 95% or greater as meaningless.
There are several reasons why this is a misuse of bootstrapping.

1) Because bootstrap resampling of N sites necessarily occurs over
a number of sites less than N, for any given bootstrapped
replicate, NO tree based on any replicate will be constructed
using as much information as a tree that uses all sites.
In other words, no replicate tree can be as good as the 
tree made using all the data.

2) The number of polymorphic characters is not infinite. For 
example, if you have a protein that is 300 amino acids long,
and your set of sequences is polymorphic at only 2/3 of the sites.
You have only 200 polymorphic sites. Assuming resampling is done
using a normal distribution, in any given tree some sites will
be represented many times, and some sites will not be represented
at all. Each individual tree is biased towards some subset of
sites. This should all average out if you do enough bootstrap
replicates, so that all trees are biased at different places
each time. 

Having all trees biased is not a bad thing. In fact, it tries
to simulate what would happen if we could keep going back 
to our population and getting fresh data. For very large
datasets (ie. long sequences) we should always get about
the same answer.

Small datasets (eg. short sequences, small numbers of RFLP 
or RAPD markers) are particularly sensitive to sampling.
This is because as you get to the terminal branches of a 
tree, the choice of where to put a sequence depends on 
a very small number of polymorphisms. Imagine proteins A
and B that differ by only 3 amino acid substitutions. Some
bootstrapped samples would only pick up 2, or 1 or none of the
polymorphic sites. Thus, in some of the trees, the distance
between those two sequences would be 0. If there was another 
protein C, which differed from A by only 1 amino acid substitution,
in replicates where the A-B distance is 0, B would be 
placed closer to A on the tree than would C, provided that
the 1 polymorphic site between A and C was included in the
replicate. So small datasets are always going to have
lower bootstrap confidence limits. Some sequences might 
cluster close together not because they are closely-related,
but because the data set we happened to get makes them
look closely-related. In this way bootstrapping tells us
that when we have a small dataset, it is inherently
less reliable than a large one.

3) On the other hand, one has to wonder whether resampling
estimates for small datasets really mean the same thing 
as estimates based on large datasets. In larger datasets,
when each bootstrap replicate is likely to be unique.
For small datasets, there's less data to sample, so 
you keep resampling the same sites over and over. The
bootstrap estimate carries with it the assumption that
each replicate is independent of other replicates. That
probably isn't true for small datasets.

My conclusions:
a) bootstrap estimates appear to be a
conservative estimate of the reproducibility of the data.
Perhaps the most important question is, HOW conservative?
b) The best tree is the tree that uses ALL the sites.
c) Perhaps the best way to use bootstrap estimates is
as a means of comparing the relative strength to which
different groupings are supported.

I know you're out there Joe. Have I got it right?
With more people using bootstrapping, and probably
reading too much into it, perhaps these points should
be brought up in a letter to the editor, somewhere.

Brian Fristensky                | 
Department of Plant Science     |  "... the lingering after-winter, 
University of Manitoba          |   the gray season, March and April,
Winnipeg, MB R3T 2N2  CANADA    |  the months God created to show 
frist at           |  people who don't drink what a
Office phone:   204-474-6085    |  hangover is like."
FAX:            204-474-7528    |            - Garrison Keillor          WOBEGON BOY

More information about the Mol-evol mailing list