Bootstrapping a neighbor joining tree w/Phylip?

Frank Wright mbfw at s-crim1.dl.ac.uk
Thu Jun 24 08:21:27 EST 1993


Robert Rumpf asked about bootstrapping phylogenetic trees using
the PHYLIP package....

The PHYLIP package allows you to bootstrap trees constructed using
any of the tree construction methods (including Fitch-Margoliash,
Neighbor-Joining, Parsimony).  The basic idea is that you create
many shuffled datasets by running SEQBOOT, then run the tree construction
method of your choice, and then run the CONSENSE program to get the
final bootstrapped tree.

e.g. to bootstrap a Neighbor-Joining tree derived from nucleic acid
     sequences, run

     SEQBOOT, then DNADIST and NEIGHBOR, then CONSENSE

e.g. to bootstrap a Parsimony tree derived from nucleic acid sequences,
     run

     SEQBOOT, then DNAPARS, then CONSENSE

e.g. to bootstrap a Neighbor-joining tree derived from protein sequences,
     run

     SEQBOOT, then PROTDIST and NEIGHBOR, then CONSENSE

e.g. to bootstrap a Parsimony tree derived from protein sequences, run

     SEQBOOT, then PROTPARS, then CONSENSE

The PHYLIP documentation suggests that the sequential running of several
programs could be automated, and indeed Tim Littlejohn (tim at bch.montreal.ca)
has already provided some elegant sample scripts (available by gopher-ing onto
megasun.bch.umontreal.ca) for Unix systems.

I have some much uglier scripts for both Unix and VMS that require hand
editing before running.  However, these allow options to be selected for
the PHYLIP programs (I believe Tim Littlejohn's scripts use the default
values) - my scripts  could be improved drastically (see example below: for
Unix, bootstrapping a Neighbor-Joining tree based on protein sequences) - 
however, maybe someone has already done something like this?

My scripts also try to deal with 2 other problems:

(1) The fact that the shuffled files produced by SEQBOOT can be very
    large e.g. a 4k PHYLIP input file shuffled 2000 times will be 
    8Mb!  It seems sensible to put this in a scratch area.  

(2) Using the scratch area creates the problem, however, that your
    job may interfere with other  PHYLIP jobs because PHYLIP
    programs create files like "outfile".   These other PHYLIP jobs may
    belong (a) to other users, or (b) be other PHYLIP jobs that you have
    running.  My example script (clumsily) solves this by using a directory
    on the scratch area for each user.  Each user also has several 
    subdirectories (e.g. dir1, dir2, etc).  Dir1 might be used by DNA Parsimony
    PHYLIP jobs, dir2 be DNA Neighbor-Joining PHYLIP jobs etc.  So as long
    as a user runs only one class of job at a time there will be no problem!

    The example script is using /scratch/username/dir2 to "park" the
    intermediate files.  PNJB = "protein Neighbor-Joining Bootstrap".

Apologies for the scruffy script.  Hopefully it is of use.  

Frank Wright
SASS, University of Edinburgh,
J.C.M.B. room 3610, Kings Buildings,
Edinburgh EH9 3JZ, Scotland, U.K.

frank at sass.sari.ac.uk


#==========================================================
# PNJB.batch 1.1 (c) Frank Wright 13th Feb 1993
# (writes "scratch" files to /scratch/username/dir3)
#
# YOU MUST HAVE CREATED "/scratch/username/dir3" beforehand!
#
#  Bootstrapping PROTDIST & NEIGHBOR (Neighbor Joining)
#        (kimura distance; 100 bootstrap trials)
#
# Things to alter:
#
# (0) Change all references to "username" to your own username!
#
# (1) Check the comment lines beginning # <<<<    - I have 
#     tried to remind you what is required.
#
# (2) The number of bootstrap trials (shuffles).  Remember
#     that this number should be the same for the SEQBOOT
#     and PROTDIST and NEIGHBOR  programs! 
# 
#==========================================================
# <<<< move to directory required...
# copy file to scratch area, and "cd" to your scratch dir.. 
#==========================================================
cd
cd myworkdir
cp hbbpep.phy   /scratch/username/dir3/infile1
cd /scratch/username/dir3
#==========================================================
# first make shuffled "multiple data sets" with SEQBOOT ...
#==========================================================
# <<<< After next line: replace with own options if reqd.
seqboot << END1
infile1
123
R
100
2
Y
END1
#==========================================================
# now run PROTDIST on shuffled "multiple data sets"...
#==========================================================
\rm infile1
mv  outfile  infile2
# <<<< After nextline: replace with own options if required.
protdist << END2
infile2
P
M
100
2
Y
END2
#==========================================================
# Tidy disk space & run NEIGHBOR-NJ on "multiple" dists...
#==========================================================
\rm infile2
mv outfile infile3
neighbor << END3
infile3
M
100
O
1
2
3
Y
END3
#==========================================================
# Tidy disk space & run CONSENSE to get bootstrapped tree..
#==========================================================
\rm infile3
\rm outfile
mv treefile infile4
consense << END4
infile4
O
1
R
2
Y
END4
#==========================================================
# Tidy disk space.
# <<<< Exit to own dir.
# <<<< give meaningful names to output files.  
#==========================================================
\rm infile4
cd
cd myworkdir
mv /scratch/username/dir3/outfile  PNJB.out
mv /scratch/username/dir3/treefile PNJB.tree
#==========================================================




More information about the Bio-soft mailing list