informative sites & bootstrap

Warren Gallin wgallin at gpu.srv.ualberta.ca
Thu May 7 12:19:25 EST 1998


In Article <6iqsd9$c9f at net.bio.net>, Brian Foley <btf at t10.lanl.gov> wrote:
>Warren Gallin wrote:
>> ...
>>     If you then include the uninformative sites in the 
>> bootstrap, you are simply diluting out the signal.  
>
>        The ratio of informative sites to non-informative
>sites does provide some information though, that can
>be considered in some evolutionary distance estimates.
>If I have 10,000 bases of sequence for 6 organisms and
>they are closely related so that only 100 of the sites
>are informative, I am pretty sure that each site is
>truly informative and not mis-informative (saturated
>with mutations).   On the other hand if I have 100
>bases of DNA sequence for 6 organisms and all 100 have
>changed in at least one of the 6, I would bet that many
>of the sites had been mutated many times back and forth
>between all 4 possible bases.  
>        The most parsimonious tree may or may not be the 
>true evolutionary tree.  

Let me back up a bit here.  As far as I know [correct me if I am wrong] the
concept of uninformative site only applies to some MP analyses, so although
I agree that the most parsimonious tree is not necessarily the best
reconstruction of evolution, this discussion is only about MP analysis. 
Although I see the point of your discussion above, it is not relevant to a
MP approach; it is appropriate to a ML approach and distance approaches.

    There is a difference between uninformative and misinformative.  My
point is that the bootstrap is a way of dealing with the misinformative
sites, but that uninformative sites will weaken the bootstrap.

>> The 
>> result is that you get two condfounding factors in the 
>> resulting bootstrap tree 1) the thing that you want, an
>> estimate of how well the tree topology is supported by 
>> the informative sites and 2) a thing that you do not want, 
>> a variable number of informative sites in each bootstrap 
>> replicate.
>
>        If you eliminate non-informative sites, you still
>have no way of knowing how many sites are mis-informative.
>So you still get a variable number of sites that are
>truthfully informative.

Once again, I don't see the relevance of this point to a MP analysis.  The
number of uninformative sites in a MP dataset has no relevance to the method
of tree searching.  It looks to me like you are mixing criteria for two
different methods of tree reconstruction.

>        HIV DNA and protein sequence data is a good testing
>ground for some of this type of work, because we have some
>24,000 sequences from closely related (and a few distant
>SIVs and other lentiviruses) and we have dates and known 
>evolutionary history on some of them.  For example we have
>several lab workers and a few chimpanzees all infected with
>the same molecular clone of lab-grown HIV (the lab workers
>were accidental exposure, the chimps were injected 
>deliberately) and then sampled over time.  

[Suggestion about simulation and example of base composition bias deleted]

Once again, I think this is a very important issue to pursue, and I look
forward to reading the results, but this is not relevant to the issue of the
thread.  None of these factors is relevant to a MP analysis, and there are
no uninformative sites in ML and distance methods.  Your discussion of
modelling suggests to me that you are using the dataset to do ML analyses. 
That is a great dataset for testing ML methods.  It just seems to me that
you are confusing two different methodologies.  If I am missing the point, I
look forward to a more explicit discussion.

Regards,



Warren Gallin
Department of Biological Sciences
University of Alberta
Edmonton,  Alberta     T6G 2E9
Canada
wgallin at gpu.srv.ualberta.ca




More information about the Mol-evol mailing list