Parametric bootstrapping

John Huelsenbeck johnh at brahms.biology.rochester.edu
Mon May 1 18:24:45 EST 2000


aswaniv at naos.si.edu

Dear Vijay:

I saw your post on bionet.molbio.evolution. I think I can answer your
question.

You calculate the log likelihood ratio test statistic for your original
data. The test statistic is D = lnL0 - lnL1, where lnL0 is the log
likelihood under the constraint that the subgroup of interest is
monophyletic and lnL1 is the likelihood without the constraint.

You want to generate the probability (null) distribution of D; this
distribution is generated under the assumption that the subgroup
really is monophyletic. How do you generate the null distibution?

1. Find the maximum likelihood tree, branch lengths, and substitution
    parameters under the null model (i.e., that the group is monophyletic).

2. Simulate 100 (or more) data sets using the tree from step #1. You
    are assured that the data were generated under the null model (i.e., you 
    know for a fact that each simulated data set really did have the subgroup
    of interest monophyletic).

3. For each simulated data set, calculate lnL0 and lnL1. Find D for each
    simulated data set. 

4. You will then have 100 (or more) D's. The frequency histogram of the
    D's calculated from the simulation represent an approximation of
    the null distribution.

5. If your observed D is greater than 95% of the simulated D's, then reject
    the null hypothesis (that the group is monophyletic) at the 5% level.

You should be able to use SeqGen for step 2. I also have a program (available
at http://brahms.biology.rochester.edu) that will simulate data under the
HKY85+Gamma model of DNA substitution.

The most time consuming part of the procedure is step # 3. You can automate
this step as follows: Inbed all 100 simulated data sets into a single file.
Your file will look something like this:

#NEXUS

begin data;
   dimensions ntax=x nchar=y;
   format datatype=dna;
   matrix
   Taxon 1 AACGT...
   Taxon 2 AACGG...
   etc...
   ;
end;

begin paup;
   [PAUP COMMANDS HERE]
end;

begin data;
   dimensions ntax=x nchar=y;
   format datatype=dna;
   matrix
   Taxon 1 AACGT...
   Taxon 2 AACGG...
   etc...
   ;
end;

begin paup;
   [PAUP COMMANDS HERE]
end;

begin data;
   dimensions ntax=x nchar=y;
   format datatype=dna;
   matrix
   Taxon 1 AACGT...
   Taxon 2 AACGG...
   etc...
   ;
end;

begin paup;
   [PAUP COMMANDS HERE]
end;

begin data;
   dimensions ntax=x nchar=y;
   format datatype=dna;
   matrix
   Taxon 1 AACGT...
   Taxon 2 AACGG...
   etc...
   ;
end;

begin paup;
   [PAUP COMMANDS HERE]
end;

begin data;
   dimensions ntax=x nchar=y;
   format datatype=dna;
   matrix
   Taxon 1 AACGT...
   Taxon 2 AACGG...
   etc...
   ;
end;

begin paup;
   [PAUP COMMANDS HERE]
end;

begin data;
   dimensions ntax=x nchar=y;
   format datatype=dna;
   matrix
   Taxon 1 AACGT...
   Taxon 2 AACGG...
   etc...
   ;
end;

begin paup;
   [PAUP COMMANDS HERE]
end;

(This is obviously meant to work with PAUP*).
Search and replace all of the "[PAUP COMMANDS HERE]" with

   set autoclose=yes warnreset=no increase=auto;
   constraints my_constraint = ((1,2,3[whatever));
   set criterion=parsimony;
   hsearch enforce=yes;
   set criterion=likelihood;
   lset nst=6 rmatrix=est basefreq=est rates=gamma shape=est pinvar=est;
   lscores 1;
   lset rmatrix=prev basefreq=prev shape=prev pinvar=prev;
   hsearch start=1 enforce=yes;
   lset nst=6 rmatrix=est basefreq=est rates=gamma shape=est pinvar=est;
   lscores 1;
   lset rmatrix=prev basefreq=prev shape=prev pinvar=prev;
   hsearch start=1 enforce=yes;
   
   set criterion=parsimony;
   hsearch enforce=no;
   set criterion=likelihood;
   lset nst=6 rmatrix=est basefreq=est rates=gamma shape=est pinvar=est;
   lscores 1;
   lset rmatrix=prev basefreq=prev shape=prev pinvar=prev;
   hsearch start=1 enforce=no;
   lset nst=6 rmatrix=est basefreq=est rates=gamma shape=est pinvar=est;
   lscores 1;
   lset rmatrix=prev basefreq=prev shape=prev pinvar=prev;
   hsearch start=1 enforce=no;

This will perform two quick (quick for maximum likelihood, that is)
searches, one
under the constraint that the group is monophyletic and the other unconstrained.
Make certain that on the first line of the the first paup block encountered, 
you put in the line

   log start file=myfile;

and that the last line of the last paup block has 

   log stop;

That way, all of the results of the analysis will be output to a log file
which you
can read at your leisure. 

I don't have PAUP* on my computer at home, so the above paup block was from
memory. Make a small test file to check it.

On another note, you may also want to try the program BAMBE. BAMBE is written
by Don Simon and Bret Larget (I don't know the URL, but you can reach Larget's
page through the links page at my web site). The program performs Bayesian
estimation of phylogeny. The program returns the posterior probabilities of
trees (the posterior probability of a tree is the probability of the tree
conditioned
on the observed DNA sequences). The probability that the subgroup of interest
is monophyletic is simply the sum of the posterior probabilities of trees having
this group. This analysis will take a small fraction of the time that the LRT
will take.

Good Luck.

John Huelsenbeck







In article <8ek5ui$av5$1 at mercury.hgmp.mrc.ac.uk>, "Vijay Aswani, Ph.D."
<vaswani at sinfo.net> wrote:

>Hi everyone,
>
>I have a dataset of 12 taxa and ~4,000-odd bases that I am trying to
>analyze. Specifically, I am trying to test the monophyly of a sub-group of
>taxa on the tree. ModelTest selects GTR+I+G as the best model for my data
>under Maximum-Likelihood. I want to do the Log-Likelihood Ratio test on this
>hypothesis (monophyly of a sub-group within the tree). To test the
>significance I thought I would do parametric bootstrapping. Can anyone
>please show me how I go about this, step by step?
>
>I am thinking I would go about it as follows:
>
>1. Generate simulated data sets using Seq-Gen (how many are appropriate?) Do
>I generate two groups of datasets - one with the best tree and the other
>with the best tree with the sub-group constrained as monophyletic?)
>2. I guess I would then have to compute the likelihood scores for each of
>the simulated datasets. Assuming I did a 100, is there any automated way of
>doing this? Or do I have to open each dataset in PAUP*, load the appropriate
>tree, calculate likelihood scores and append them to a file?
>3. What do I do next? Do I make a matrix doing substractions of every
>combination of likelihood values from the best tree set with those from the
>constrained tree set? Is there any program to do this? Is this what
>generates the distribution of delta values to which I would compare the
>delta obtained from the real data set?
>4. Am I on the right track?!
>
>Has anyone out there done parametric bootstrapping to test monophyly by the
>method of Huelsenbeck et al (1996, 1997)? Can you share the procedural
>details of how you did this?
>
>Thanks in anticipation, to all who reply.
>
>Sincerely,
>
>Vijay
>
>/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
>Vijay Aswani, Ph.D.
>Smithsonian Tropical Research Institute,
>Unit 0948,
>APO AA 34002-0948
>U.S.A.
>Tel (in Panama): 507-212-8824  (work), 236-3243 (home)
>Fax: 507-228-0516
>Email: vaswani at sinfo.net or aswaniv at naos.si.edu
>Homepage: http://aswani.freehosting.net
>/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
>
>
>---







More information about the Mol-evol mailing list