informative sites & bootstrap

Brian Foley btf at t10.lanl.gov
Wed May 6 18:40:25 EST 1998


Warren Gallin wrote:
> ...
>     If you then include the uninformative sites in the 
> bootstrap, you are simply diluting out the signal.  

	The ratio of informative sites to non-informative
sites does provide some information though, that can
be considered in some evolutionary distance estimates.
If I have 10,000 bases of sequence for 6 organisms and
they are closely related so that only 100 of the sites
are informative, I am pretty sure that each site is
truly informative and not mis-informative (saturated
with mutations).   On the other hand if I have 100
bases of DNA sequence for 6 organisms and all 100 have
changed in at least one of the 6, I would bet that many
of the sites had been mutated many times back and forth
between all 4 possible bases.  
	The most parsimonious tree may or may not be the 
true evolutionary tree.  

> The 
> result is that you get two condfounding factors in the 
> resulting bootstrap tree 1) the thing that you want, an
> estimate of how well the tree topology is supported by 
> the informative sites and 2) a thing that you do not want, 
> a variable number of informative sites in each bootstrap 
> replicate.

	If you eliminate non-informative sites, you still
have no way of knowing how many sites are mis-informative.
So you still get a variable number of sites that are
truthfully informative.

	HIV DNA and protein sequence data is a good testing
ground for some of this type of work, because we have some
24,000 sequences from closely related (and a few distant
SIVs and other lentiviruses) and we have dates and known 
evolutionary history on some of them.  For example we have
several lab workers and a few chimpanzees all infected with
the same molecular clone of lab-grown HIV (the lab workers
were accidental exposure, the chimps were injected 
deliberately) and then sampled over time.  
	Another alternative is generating sequences with
a computer and a program which can model evolution, and
then taking sequences with a known history and trying to
reconstruct the true tree.  The disadvantage is that no
program yet can take into account all the strange things that
actually happen in biology.  We know HIV has a propensity toward
G -> A transitions and its genome is A rich and C poor:

Empirical Base Frequencies:

   A       0.39856
   C       0.16709
   G       0.18683
  T(U)     0.24752

but we don't know exactly why.  And we do know that in
regions of the genome were RNA secondary structures are
important, the G+C ratio increases as expected.
So the computer models assume a certain model of evolution and
those exact same models are available to construct the tree.
In real life the model is hidden to some degree.

-- 
 ____________________________________________________________________
|Brian T. Foley               btf at t10.lanl.gov                       |
|HIV Database                 (505) 665-1970                         |
|Los Alamos National Lab      http://hiv-web.lanl.gov/index.html     |
|Los Alamos, NM 87544  U.S.A. http://www.t10.lanl.gov/~btf/home.html |
|____________________________________________________________________|




More information about the Mol-evol mailing list