IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Program for tree comparison

Ed Adams adams at netcom.com
Fri Apr 14 12:01:39 EST 1995

Warren Frank Lamboy (wfl1 at cornell.edu) wrote:
: In article <3mdtof$8sj at sunserver.lrz-muenchen.de>,
: strimmer at wap18.zi.biologie.uni-muenchen.de (Korbinian Strimmer) wrote:

: > 
: > To all tree reconstructors out there! 
: > 
: > Using the programs from the PHYLIP package of Joe Felsenstein I have
: > produced quite a huge number of trees written down as specified within
: > the "New Hampshire" standard for computer readable trees (an example 
: > for an unrooted tree may be  (a, b, ((c, d), e)); ) Now I want to compare
: > all these trees (that are all in one big treefile) to one specific tree
: > that is also given in another file. I want to count how many trees in the
: > big treefile are identical to the specified tree. As there are many
: > possibilities for writing down a given tree in the "New Hampshire"
: > form one can not simply compare the two files with a text editor
: > but one must think of another way. I suppose that this program must 
: > work in a way Consense (from PHYLIP) works, but Consense alone gives
: > no answer to my problem.
: > 
: > I am very sure that many people must have encountered this problem before,
: > and I am sure that there exists already a solution to this. If you know
: > how to deal with this problem please give me a hint and contact me!!
: > 
: > Thank you
: > 
: > Korbinian Strimmer
: > 
: > 
: > ----------------------------------
: > strimmer at zi.biologie.uni-muenchen.de

: If I understand your problem correctly, I think that one way to do this
: would be to compute Robinson and Fould's partition metric between each tree
: in the data set and the given specific tree.  A value of 0 for the
: partition metric would indicate that the two trees are identical.  (The
: partition metric is a count of the number of branch contractions and node
: expansions that it takes to convert one tree into another.  If none are
: needed, then the trees are the same.)  Unfortunately, I know of no program
: that computes this value for you--so I am afraid that this may not be much
: help unless you are a programmer or have access to one.  The reference is:
: Robinson, D.F. and L.R. Foulds. 1981.  Comparison of phylogenetic trees. 
: Mathematical Biosciences 53:  131-147.  Sorry I can't be of more help.

: -- 
: Warren F. Lamboy                            "It's easy if you know how to
:                                              do it, but it's impossible if
:                                              you don't know how to do it."

Here is a long-winded explanation of how to compare the trees directly.
The theorem is not proved here, but here's the method and examples. 
Questions and comments are welcome.


Here's a strategy, followed by an algorithm:

Strategy: Converted the tree into a canonically ordered list of its clusters,
	then compare such lists for equality.

	A cluster in an unrooted tree corresponds to an arc in its graph 
representation. The cluster can be represented as an unordered pair of
sets (S,T). If we were to break the tree at this arc, we would separate it
into two sets of leaves; call them S and T ( in either order ).

Example:   (a, b, (( c, d), (e, f)))

The clusters are:

( {a}, {bcdef} ),
( {b}, {acdef} ),
( {c}, {abdef} ),
( {d}, {abcef} ),
( {e}, {abcdf} ),
( {f}, {abcde} ),
( {cd}, {abef} ),
( {ef}, {abcd} ),
( {cdef}, {ab} ).

Obviously, the leaf clusters are trivial and guaranteed to be in every
tree on these nodes, so let's ignore them and list only the non-trivial

( {cd}, {abef} ),
( {ef}, {abcd} ),
( {cdef}, {ab} ).

Now, since the items in the pair are unordered, sort them into lexicographical

( {abef}, {cd} ),
( {abcd}, {ef} ),
( {ab}, {cdef} ).

Then, since the clusters are also unordered, sort THEM into lexicographical

( {ab}, {cdef} ),
( {abcd}, {ef} ),
( {abef}, {cd} ).

Do the same to another tree on the same nodes. It will have this representation
if and only if it has exactly the same clusters.

ALGORITHM (informally stated):

To identify a cluster: Every parenthesis-balanced string that is either
	(a) between left parenthesis and comma,
	(b) between commas, or
	(c) between comma and right parenthesis
contains some leaf identifiers. The cluster consists of that set of leaves,
call it (S), and all the rest of the leaves, call that set (T). The cluster
is the pair ( S, T ).
	A non-trivial cluster is one whose S and whose T each have more than
one leaf.

To transform the New Hampshire tree this useful form:
	For all non-trivial clusters:
		Sort the leaves of S into alphabetic order,
		Sort the leaves of T into alphabetic order.
	Sort the clusters in lexicographic order.

To compare two trees in useful form:
	If the character strings are identical, the trees are identical.


	T1 = (a, b, ((c, d), (e, f))) versus T2 = (b, a, (( e, f), (c, d)))

	Clusters of T1: 		Clusters of T2:
	({cd}, {abef}),			({ef},{abcd}),
	({ef}, {abcd}),			({cd},{abef}),
	({cdef}, {ab}).			({efcd},{ab}).

	Sort Leaves within sets:
	({cd}, {abef}),			({ef},{abcd}),
	({ef}, {abcd}),			({cd},{abef}),
	({cdef}, {ab}).			({cdef}, {ab}).

	Sort sets within clusters:
	({abef}, {cd}),			({abcd}, {ef}),
	({abcd}, {ef}),			({abef}, {cd}),
	({ab}, (cdef}),			({ab}, {cdef}),

	Sort clusters within tree:
	({ab}, {cdef}),			({ab}, {cdef}),
	({abcd}, {ef}),			({abcd}, {ef}),
	({abef}, {cd}),			({abef}, {cd}).

	Compare trees for equality:


	T3 = ((a,b), c, ((d,e),(f,g)))   T4 = (f, g, ((d,e),(c,(a,b))))

	({ab}, {cdefg}),		({de}, {abcfg}),
	({de}, {abcfg}),		({ab}, {cdefg}),	
	({fg}, {abcde}),		({cab}, {defg}),
	({defg}, {abc}).		({decab}, {fg}).

	Sort leaves:
	({ab}, {cdefg}),		({de}, {abcfg}),
	({de}, {abcfg}),		([ab}, {cdefg}),
	({fg}, {abcde}),		({abc}, {defg}),
	({defg}, {abc}).		({abcde}, {fg}),

	Sort sets:
	({ab}, {cdefg}),		({abcfg}, {de}),
	({abcfg}, {de}),		({ab}, {cdefg}),
	({abcde}, {fg}),		({abc}, {defg}),
	({abc}, {defg}),		({abcde}, {fg}).

	Sort clusters:
	({ab}, {cdefg}),		({ab}, {cdefg}),
	({abc}, {defg}),		({abc}, {defg}),
	({abcde}, {fg}),		({abcde}, {fg}),
	({abcfg}, {de}).		({abcfg}, {de}).



	Notice, once you have sorted the leaves within each set,
and sorted the two sets with a cluster, you can drop the second set of
each cluster, as being predictable.

Have fun,


More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net