Protein composition convergence

Brian Foley btf at lanl.gov
Wed Sep 22 15:28:17 EST 1999

James McInerney wrote:
> Dear all,
> Traditional dogma suggests that we should use protein sequences for
> inferring relationships from molecular sequences in those instances when
> the underlying DNA sequences might be suffering from convergence due to
> mutational bias.

	Most phylogenetic reconstruction programs have an underlying
assumption that the sequences under scrutiny are not under selective
pressure.  In practice, almost everyone uses sequences that are
in fact under selective pressures.
	I work with primate immunodeficiency virus gene sequences
and sometimes other lentiviral sequences as well.  This is a very
rich data set with over 35,000 sequences from a very small genome
(9.7 kb).  We have a whole range of genetic distances available
for analyses, from different virions within the same patient collected
at different times or even from the same blood sample (typically
these sequences range from 100% identity to 95% identity), to 
sequences from different individuals (range from 96% identity to
88% identity) infected with the same subtype, to sequences from
different subtypes (range from 92% identity to 76% identity) to
sequences from different groups (HIV-1 M group vs HIV-1 O group;
less than 75% identical and silent sites are fully saturated with
mutations, so simple % identity measurements begin to be far from
true measures of phylogenetic distance) to sequences from different
primate hosts (several subspecies of African green monkeys, sooty
mangabeys, Sykes monkeys, Mandrills etc...).
	It is very clear that sequence divergence, by any measure
from simple percent identity to the best phylogenetic estimate,
does not increase linearly with time.  The virus evolves at roughly
0.5% per year, and it is very clear that after 200 years we cannot
see 100% divergence because any two random sequences have some
identical sites by chance alone.  The best phylogenetic reconstructions
we are currently able to do, seem to be pretty good at estimating
distances and times of divergence out to about 80 years or so,
with ther error bar growing larger with each year.  We are thus,
at this point, totally unable to estimate the date of the divergence
between the Afican green monkey SIVs and the sooty mangabey SIVs for
example.  It could be as short as 120 years ago, or longer than
40 million years ago!
> The suggestion being that protein sequences suffer very little from
> compositional convergence.  I am wondering how true this is. 

	It is highly variable, from protein to protein, and even
within different domains of the same protein.  Some sites are under
positive seletion to change rapidly (the HIV env is forced by the 
host immune system to change glycosylation sites often; hemoglobin might
need to change to bind O2 more tightly for some creatures such as whales
than others; MHC and Ig genes need to evolve rapidly; etc.) and other
sites are highly constrained or totally invariant.

> If we
> think about the classification of amino acids (aromatic, small polar
> etc.) then there are only a limited number of _allowable_ substitutions
> at any one site (I am of course using this term _allowable_ in a loose
> way).  In other words, the substitution space for a particular amino
> acid is much smaller than 19 (20 including indels) other character states.

	Yes.  Cys-Cys disulfide bond pairs for example are totally
invariant in HIV-1 env, for example.
> So, what about convergence in protein-coding sequences?  Is it rampant?
> Is it as extensive as (for instance) thermophilic convergence in
> ribosomal RNA sequences?

	It depends on the protein.  Some are free to evolve at
most sites except the active catalytic core.  Others such as
ribosomal proteins have so many interactions with so many other
proteins, that a great many sites are conserved.
> In reality, if an aromatic amino acid is needed at a particular
> location, then the replacement of phenylalanine by tryptophan or
> tyrosine is much more likely and also the existence of homoplastic
> changes for this site is probably more likely than at the nucleotide
> level when there are four alternatives, rather than (_effectively_) two!

	I am currently trying to work with people who are developing
models of evolution that take into account the selective forces
acting on each site in a protein.  Using iterations of tree building,
model refinement, and re-building the tree, we hope to estimate 
site-specific rate variations in the protein.  This seems to be
the only hope for estimating the date of divergent of organisms
such as the primate immunodeficiency viruses, for which there is
no fossil record to calibrate the clock.
> So, stepping off my soapbox for a second, does anybody agree with this
> comment, or is it completely wrong?  I have inferred amino acid
> compositional trees and often it is possible to generate very different
> trees on the basis of composition and on the basis of, say parsimony or
> likelihood analysis of the characters.  So there are homoplastic amino
> acid compositional changes, it does exist.  But, does it affect
> phylogeny reconstruction?

	It can.  The more distant the sequences, the more it does.
If you only want branching order, and don't care about the distances
beteen the branch points, it doesn't always matter.
> Do we have any good studies of amino acid compositional convergence?
> Protein similarity that is not due to recentness of common ancestry, but
> rather due to compositional convergence (or parallelism, or reversal or
> any homoplastic event you like to name)?

	If you look at a highly conserved protein such as Elongation
Factor 2 (the bacterial homolog is called EF-G) you can see that even
when comparing mouse or rat or hamster to human sequences, there are
something like 100 mutations, of which 98 or 99 are in silent sites.
In other words it has a huge synonymous:non-synonymous ratio.
	We have built a tool for analysing the syn:nonsyn rates
at each site in an alignment, which are very useful for comparing
the selective pressure on different genes, or different regions of
the same gene.


> Any input is gratefully received.

	The bottom line is that the "molecular clock" is there, but
it ticks at a different rate for each gene/protein and for each site
within a gene/protein.  It is also a Poisson distribution of ticks,
so that it gives only an estimate of the time, not a true time, unless
an infinite numer of sites are used.  The only way to calibrate the
clock is to have good fossils with good dates, sequences of ancient
DNA, or other strong data for at least some of the nodes in the
tree.  A caveat, is that it is possible (even expected) that different
lineages evolve at different rates, even within the same site in the 
same gene.  For example we can see that the V3 loop of subtype
D of HIV-1 M group evolves much faster than the same region of subtype
C, while the flanks of the V3 loop evolve at roughly the same rate
in subtypes C and D.

	The good news is that for most data sets, even "quick and
easy" methods such as PHYLIP neighbor-joining give essentially the
same tree topology as the very most computer-intensive maximum
likelyhood with site-specific rates methods.  The choice of data
points, outgroups and so on can have as large an effect as the
choice of method.  The brancing order is close to the same for each
method, but the inferred times of divergence do differ quite a 
bit. i.e. branch lengths vary between methods.

|Brian T. Foley               btf at t10.lanl.gov                       |
|HIV Database                 (505) 665-1970                         |
|Los Alamos National Lab      http://hiv-web.lanl.gov/index.html     |
|Los Alamos, NM 87544  U.S.A.                                        |

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net