Program for tree comparison
Ed Adams
adams at netcom.com
Fri Apr 14 16:07:38 EST 1995
Ed Adams (adams at netcom.com) wrote:
: Warren Frank Lamboy (wfl1 at cornell.edu) wrote:
: : In article <3mdtof$8sj at sunserver.lrz-muenchen.de>,
: : strimmer at wap18.zi.biologie.uni-muenchen.de (Korbinian Strimmer) wrote:
: : >
: : > To all tree reconstructors out there!
: : >
: : > Using the programs from the PHYLIP package of Joe Felsenstein I have
: : > produced quite a huge number of trees written down as specified within
: : > the "New Hampshire" standard for computer readable trees (an example
: : > for an unrooted tree may be (a, b, ((c, d), e)); ) Now I want to compare
: : > all these trees (that are all in one big treefile) to one specific tree
: : > that is also given in another file. I want to count how many trees in the
: : > big treefile are identical to the specified tree. As there are many
: : > possibilities for writing down a given tree in the "New Hampshire"
: : > form one can not simply compare the two files with a text editor
: : > but one must think of another way. I suppose that this program must
: : > work in a way Consense (from PHYLIP) works, but Consense alone gives
: : > no answer to my problem.
: : >
: : > I am very sure that many people must have encountered this problem before,
: : > and I am sure that there exists already a solution to this. If you know
: : > how to deal with this problem please give me a hint and contact me!!
: : >
: : > Thank you
: : >
: : > Korbinian Strimmer
: : >
: : >
: : > ----------------------------------
: : > strimmer at zi.biologie.uni-muenchen.de
[Lamboy's post snipped]
My previous post, giving a long-winded algorithm, was sloppy. Here
is a more refined, simpler, shorter-winded approach. It sounds like
what I think Joe Felsenstein was suggesting, although merely flipping
around isn't enough.
COMPARING NEW HAMPSHIRE UNROOTED TREES
with UNIQUE leaves
Here's a strategy, followed by an algorithm:
Strategy: Converted the tree into a canonically ordered list of its clusters,
then compare such lists for equality.
A cluster in an unrooted tree corresponds to an arc in its graph
representation. The arc separates the tree into two sets, call them S and T,
so the cluster is represented by that pair of sets. For simplicity, we can
represent the cluster by whichever of (S, T) contains a particular leaf,
such as "a", as long as we represent all clusters in all trees with respect
to the same leaf.
In the New Hampshire representation, a cluster appears as a
parenthesized group. If we take the set of leaves in that group to be
set S, and its complement to be set T, we have a cluster as defined above.
Once again, we can represent the cluster by whichever of (S,T) contains "a".
There are some trivial clusters: The set of all the leaves, its
complement (designated by "0"), any set consisting of just one leaf,
any set consisting of all but one leaf. These clusters are common to all
trees on these leaves, so we aren't interested in them.
Example: (a, b, (( c, d), (e, f)))
The clusters are:
( {abcdef}, {cdef}, {cd}, {ef})
Replacing each set that doesn't contain "a" with its complement,
( {abcdef}, {ab}, {abef}, {abcd} )
Removing the trivial cluster
( {ab}, {abef}, {abcd} ),
Then, since the clusters are unordered, sort them into lexicographical
order:
( {ab}, {abcd}, {abef} ).
Do the same to another tree on the same nodes. It will have this representation
if and only if it has exactly the same clusters.
ALGORITHM (informally stated):
To identify a cluster: Every parenthesized expression contains some leaf
identifiers. The cluster consists of that set of leaves, or its complement,
depending on which set contains the leaf "a".
A non-trivial cluster is one whose size is greater than 1 and less
than ( (size of universal set) -1 ).
To transform the New Hampshire tree into this useful form:
For all non-trivial clusters:
Sort the leaves in the cluster into alphabetic order
Sort the clusters in lexicographic order.
To compare two trees in useful form:
If the character strings are identical, the trees are identical.
EXAMPLE:
T1 = (a, b, ((c, d), (e, f))) versus T2 = (b, a, (( e, f), (c, d)))
Clusters of T1: Clusters of T2:
({abcdef},{cdef},{cd},{ef}) ({baefcd},{efcd},{ef},{cd})
Replace with complement if necessary (using "a" as kernel)
({abcdef},{ab},{abef},{abcd}) ( 0, {ab}, {abcd}, {abef} )
Remove trivial clusters
({ab}, {abef}, {abcd}) ({ab}, {abcd}, {abef})
Sort Leaves within sets:
({ab}, {abef}, {abcd}) ({ab}, {abcd}, {abef})
Sort clusters within tree:
({ab}, {abcd}, {abef}) ({ab}, {abcd}, {abef})
Compare trees for equality:
Success.
ANOTHER EXAMPLE:
T3 = ((a,b), c, ((d,e),(f,g))) T4 = (f, g, ((d,e),(c,(a,b))))
Clusters:
({abcdefg},{ab},{defg},{de},{fg}) ({fgdecab},{decab},{de},{cab},{ab})
Replace with complement if necessary (using "a" as kernel)
T3 = ({abcdefg},{ab},{abc},{abcfg},{abcde})
T4 = ( 0, {decab}, {abcfg}, {cab}, {ab})
Remove trivial clusters:
({ab},{abc},{abcfg},{abcde}) ({decab}, {abcfg}, {cab}, {ab})
Sort Leaves:
({ab}, {abc}, {abcfg}, {abcde}) ({abcde}, {abcfg}, {abc}, {ab})
Sort clusters:
({ab}, {abc}, {abcfg}, {abcde}) ({ab}, {abc}, {abcde}, {abcfg})
Comparison:
Success.
Have fun again,
Ed
More information about the Mol-evol
mailing list