correlated mutations

salamon at notmendel.Berkeley.EDU
Sun Feb 11 15:39:07 EST 1996

Ewan Birney (birney at wrote:
: Aare Abroi wrote:
: > 
: > Hallo !
: > 
: > I have a question about correlated mutations. For example is it possible to
: > find amino-acids which are involved in receptor-ligand interaction by
: > analyzing receptor and ligand protein families  (I mean the some receptor or
: > ligand sequences from different organisms). The complementary surfaces may
: > by mirrored in correlated mutations in sequences. If there are some programs
: > which do this work could you please let me know.
: > 
: > --
: >        Aare Abroi                 Estonian Biocentre
: >      aabroi at                    Riia 23
: >   tel. +372 7 420 223              Tartu, EE2400
: >   fax: +372 7 420 286                 Estonia

: This questions is actually harder than it looks as it relies 
: implicitly on understanding the evolution of the proteins 
: involved  - ie you have to have a good tree at first. 

: some time ago I tried to write a program that would look for
: correlation inside a single alignment. This would be identical
: to the receptor ligand problem as one could concatonate the alignments
: of the two proteins as one...

: but the program didn't work at all well (well -- I wrote it a long
: time ago). If anyone has a good algorithm for tackling this problem
: I'd love to hear about it

: ewan

: birney at

This is an interesting topic.  
I think that it would be useful to define what kind data one could
use to discover correlated _substitutions_ in molecules encoded
by different loci.  It seems implicit in the statements above that data
of the following kind are to be used:

Data from several taxa on (at least) sequence data for a number
of loci exist.

Ignoring the question of structural information allowing the modeling
of complementary surfaces/docking (not because I think it isn't
a good approach, but because I am too ignorant on the subject to
address it here), but possibly using some structural/functional
information to narrow down the sets of sites which potentially could
interact, I suggest two general approaches.  Firstly, I have thought
about statistical/phylogenetic approach to asking whether changes
are too concordant to be due to coincidence.  Again, I find I do
not know enough to judge whether this has been done, but suspect it
would require a larger amount of data, and a greater number of
"co-substitutions" to show statistical significance than real data
is likely to present in many cases worthy of investigation.

Now let's assume that we want to search through T taxa, for
which sequences are available at S loci.  I assume that a
reasonable alignment exists for sequences at each (we hope
homologous) locus.  A reletive deletion will be represented
by a "deletion variant" at each site involved.
A second approach could exploit a method I've recently collaborated
in developing (see an abstract at,
but am not ready to share the code just yet -- it was submitted in September
and is under review).  The idea is simple, though.  What is different
about a sequence as compared to a set of other, aligned sequences?
The answer can be as simple as a single variant at a single site
(see site 3 below) or a pattern (combination) of variants
that, although found individually in the reference set of
sequences, are not found in that combination except in the
sequence being compared to the reference set (sites 1,2, and 4).

 sequence being compared:   AACBAA
 reference seqs:            ABABBA
Such minimal sets are found using the unique combinations method.  
Although there are 2^x-1 combinations of x polymorphic sites
to consider, the method does not need to investigate nearly so
many sites to find all minimal sets of sites which distiguish
a sequence from a set of other sequences.  In fact the program
has exited successfully in a tolerable amount of time for
data with more than 50 non-redundant polymorhisms.  (The
method has been extended to compare a group of seqs to
a second group, and even to deal with multilocus diploid

Now consider:

concatenate in a specified order the sequences for the S loci
for each taxon i, i = 1,...,T.

Now compare concatenated sequence 1 to sequences 2,3,...,T.
And compare 2 to 1,3,4,5...,T
and 3 to 1,2,4,5,...,T etc..

Does a particular pair (or combination, in general) of sites
continually appear in the list of unique combinations?  If so,
these could be used to 1) identify the tentatively interacting
loci, and 2) suggest examination of any molecular structural
models that exist for the participating products of the loci.

This is off the top of my head, so excuse the poor presentation (and faulty
logic?), please.  I would be happy to work with someone with a good
biological problem, and do the combinatorial analysis.  Alternatively, be
patient and the method will be available at the web site.  There is even a
possibility that this and related methods will be made into
a freindly software package and form-based web application if
the NSF feels generous about a certain proposal.

  Hugh Salamon
  salamon at

More information about the Comp-bio mailing list