Symposium on Large Phylogenies

Junhyong Kim Junhyong_kim at
Mon Dec 1 15:03:38 EST 1997


=93Estimating Large Scale Phylogenies: Biological, Statistical, and
Algorithmic Problems=94

SPONSORS: DIMACS and University of Pennsilvania Program in Computational
LOCATION: Princeton University
DATE: June 26-28, 1998
FORMAT: Paper presentations and posters. All papers for oral
presentation must be submitted in full and they will be peer reviewed.

PAPER SUBMISSION DEADLINE: April 15, 1998. Please submit papers by mail
or email (ps file/MS Word file only) to:

Junhyong Kim
Dept. of Biology
Yale University
165 Prospect st.
New Haven, CT 06511
(203)-432-3854 (fax)
junhyong_kim at

Co-organizers: Junhyong Kim, Tandy Warnow, and Ken Rice


Biological organization is fundamentally based on an evolutionary
history of bifurcating descent-with-modification. Phylogenetic
estimation is the inference of this genealogical history from present
day data. Phylogenetic trees, the graph representation of the
genealogical history, play a central role in evolutionary biology and
phylogenetic estimation techniques are being applied to a wide variety
of computational biology problems.

The size of a phylogenetic estimation problem is measured by the number
of taxa and the number of characters. Until recently, computational and
data limitations kept most phylogenetic estimation problems to small
numbers of taxa. But, the availability of computational resources and
the influx of large molecular data sets are enabling researchers to
tackle increasingly larger problems, and the analysis of large-scale
data sets is rapidly becoming a central problem in phylogenetic biology.=20

Recent experimental evidence has established the existence of large
trees that can be estimated accurately as well as those that are
difficult to accurately estimate with reasonable numbers of characters.
Some of these examples have suggested that taxon sampling (increasing
the size of the estimation problem through the addition of taxa rather
than characters) might lead to more easily estimated trees. Conversely,
it has been argued that big trees are hard to infer for a variety of
reasons: NP-hardness of the optimization problems, properties of the
search space, inadequacy of the heuristics, and even possible inadequacy
of the optimization criteria. Unfortunately, very little actual evidence
is available to support any conjectures about how the performance of
estimators scale with respect to the size of the phylogenetic problem.
In addition, the question of scaling is itself confused by poorly
delineated notions. For example,  the size of the tree also involves the
maximum amount of divergence (not only the number of taxa and
characters) and measures of estimator performance have also not been
standardly agreed upon.=20

The goal of this symposium is to precisely identify the key problems
with respect to how the performance of phylogenetic estimators scale as
with the size of the problem, and gather experimental and theoretical
results addressing this problem.=20


The symposium will consist of four topic sessions with paper
presentations followed by a panel discussion of invited experts. The
four topics and some of the questions to be addressed in each session

Biological problems

1. What are the limits to sampling characters and taxa?
2. What are examples of very difficult problems?
3. What are the reasonable models of character evolution and tree shape?
4. What are the most important problems in systematics?
5. What can we say about evolutionary history from data other than rows
and columns of homologous characters?

Empirical results

1. What do simulation studies tell us about performance of different
methods and how they scale with the size of the problem?
2. What properties of the tree models affect accuracy and how do those
properties scale?
3. Are there any methodological biases?
4. What can we say about performance under more realistic models of
sequence evolution from the existing studies?
5. Is there a need to standardize experimental studies, perhaps through
the establishment of a testbed of different model trees, methods, etc?

Algorithmic problems

1. What is the relationship between standard optimization problems
(distance-based criteria, parsimony, etc) and estimating the topology of
evolutionary trees? Which of the standard optimization criteria are best
suited to obtaining highly accurate topology estimations, given bounds
on the available sequence length?
2. How much of the difficulty is due to inadequate solution to the right
NP-hard optimization problems?
3.  Are there new optimization problems or approaches (not necessarily
linked to optimization criteria) that are promising?=20
4. How good are the existing heuristics for solving the relevent
optimization problems, and what new approaches might give better results
on important optimization problems?
5. How should we evaluate performance of algorithms?
6. Are there ``algorithms engineering" issues which will make these
methods less powerful, and how do we handle them?
7. Is it possible to design methods which can efficiently characterize
all optimal and near-optimal trees, rather than just a single optima?

Statistical problems

1. What bounds can we obtain on the convergence rate of different
2. How do various statistical properties of different methods scale with
the size of the problem?
3. What is the relationship between estimating the whole tree versus
some subset of the tree?
4. What is the distribution of specific tree characteristics such as
smallest edge length, smallest diameter for quartet covering, steminess,
etc. with respect to tree model sampling distribution?
5. Can we obtain accuracy bounded estimates (sacrificing resolution)?

More information about the Mol-evol mailing list