informative sites & bootstrap
wgallin at gpu.srv.ualberta.ca
Thu May 7 12:19:25 EST 1998
In Article <6iqsd9$c9f at net.bio.net>, Brian Foley <btf at t10.lanl.gov> wrote:
>Warren Gallin wrote:
>> If you then include the uninformative sites in the
>> bootstrap, you are simply diluting out the signal.
> The ratio of informative sites to non-informative
>sites does provide some information though, that can
>be considered in some evolutionary distance estimates.
>If I have 10,000 bases of sequence for 6 organisms and
>they are closely related so that only 100 of the sites
>are informative, I am pretty sure that each site is
>truly informative and not mis-informative (saturated
>with mutations). On the other hand if I have 100
>bases of DNA sequence for 6 organisms and all 100 have
>changed in at least one of the 6, I would bet that many
>of the sites had been mutated many times back and forth
>between all 4 possible bases.
> The most parsimonious tree may or may not be the
>true evolutionary tree.
Let me back up a bit here. As far as I know [correct me if I am wrong] the
concept of uninformative site only applies to some MP analyses, so although
I agree that the most parsimonious tree is not necessarily the best
reconstruction of evolution, this discussion is only about MP analysis.
Although I see the point of your discussion above, it is not relevant to a
MP approach; it is appropriate to a ML approach and distance approaches.
There is a difference between uninformative and misinformative. My
point is that the bootstrap is a way of dealing with the misinformative
sites, but that uninformative sites will weaken the bootstrap.
>> result is that you get two condfounding factors in the
>> resulting bootstrap tree 1) the thing that you want, an
>> estimate of how well the tree topology is supported by
>> the informative sites and 2) a thing that you do not want,
>> a variable number of informative sites in each bootstrap
> If you eliminate non-informative sites, you still
>have no way of knowing how many sites are mis-informative.
>So you still get a variable number of sites that are
Once again, I don't see the relevance of this point to a MP analysis. The
number of uninformative sites in a MP dataset has no relevance to the method
of tree searching. It looks to me like you are mixing criteria for two
different methods of tree reconstruction.
> HIV DNA and protein sequence data is a good testing
>ground for some of this type of work, because we have some
>24,000 sequences from closely related (and a few distant
>SIVs and other lentiviruses) and we have dates and known
>evolutionary history on some of them. For example we have
>several lab workers and a few chimpanzees all infected with
>the same molecular clone of lab-grown HIV (the lab workers
>were accidental exposure, the chimps were injected
>deliberately) and then sampled over time.
[Suggestion about simulation and example of base composition bias deleted]
Once again, I think this is a very important issue to pursue, and I look
forward to reading the results, but this is not relevant to the issue of the
thread. None of these factors is relevant to a MP analysis, and there are
no uninformative sites in ML and distance methods. Your discussion of
modelling suggests to me that you are using the dataset to do ML analyses.
That is a great dataset for testing ML methods. It just seems to me that
you are confusing two different methodologies. If I am missing the point, I
look forward to a more explicit discussion.
Department of Biological Sciences
University of Alberta
Edmonton, Alberta T6G 2E9
wgallin at gpu.srv.ualberta.ca
More information about the Mol-evol