testing if seqs. are in same phylo tre

Arlin Stoltzfus arlin at cs.dal.ca
Mon Nov 20 20:51:43 EST 1995

DEernisse at fullerton.edu (Doug Eernisse) wrote:

>Having said that, we won't recognize the homology between a
>"G" and "C" unless we have some other context to suggest that these
>differing states may correspond due to ancestry. For example,
>flanking sites may have an identical or highly similar distribution 
>of states. It may be possible to estimate the likelihood that the
>observed similarity or pattern of similarity in the sequence overall
>is not due to chance. 

There is a method for determining this, based on randomized alignment
scores.  The basic idea is simple-- just randomize the candidate
sequences numerous times, align them, and use the alignment scores
as a null distribution to address the question of 
whether the alignment score for the native sequences is significantly
better than expected by chance.  This automatically accounts for
frequencies of character states (nucleotides or amino acids), and
subsequences of length N can be shuffled if one also wants to 
take into account N-mer frequencies (e.g., dinucleotides or 
trinucleotides).  This is all explained in Russ Doolittle's excellent 
and highly readable handbook, _Of URFs and ORFs_.

>There have also been attempts by chemists/physicists to claim genes 
>are homologous despite low (< 20%) identity based on how the sequences 
>conform to a model of higher-level structure (someone from EMBL research 
>group, sorry can't remember who). These claims of homology were disputed
>by others who thought that it was equally or more likely that there were 
>only so many ways to fold or twist a string of amino acids (i.e., 
>convergence of structural features). As far as I know, this is still
>an ongoing debate as to whether all proteins can be traced back to
>relatively few ancestral proteins or whether new proteins arise
>frequently and might share similar structural properties with existing

An excellent summary of the basic logic of this question!  Its difficult 
to see how such a question can be resolved with only negative evidence.
A case in point is the set of proteins that all share the so-called
'TIM barrel' structure seen in TPI and several other enzymes of 
central carbon metabolism, which all share a toroidal (donut-shaped)
domain with alternating alpha helices and 8 beta strands, the latter forming
the barrel in the center.  Because the strands form a pleated sheet, the
toroidal domain has a rotational symmetry of 90 degrees.  The hypothesis
investigated by one group (I wish I had the reference) was that if the 
different proteins were related, they might show more sequence 
similarity when their amino and carboxy ends were aligned, rather 
than in one of the other three rotationally isomeric alignments 
in which the ends did not align.  As it turned out, there was no 
statistically significant excess of amino acid similarity for the 
native alignment.  So, no conclusion could be reached on the basis 
of these negative results: either the shared structure was
a result of convergent evolution, or the proteins were actually 
homologous but had diverged so greatly that no significant similarity 
could be detected. 

Meanwhile, some protein chemists are quite ready to draw a firm conclusion.  
If one reads Branden & Tooze, _An Introduction to Protein Structure_,
they state that different TIM barrel proteins do not have demonstrably
higher-than-random sequence similarity, and then they state unequivocably
that the proteins are therefore not homologous, a conclusion that seems


Arlin Stoltzfus
Department of Biochemistry
Dalhousie University
Halifax, Nova Scotia B3H 4H7 CANADA
arlin at is.dal.ca 902-494-3569 (phone) 902-494-1355 (fax)

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net