Sequence Diversion Numbers - What do they Mean

Brian Foley btf at lanl.gov
Thu Feb 11 12:02:30 EST 1999


Paul D. Roughan wrote:
> 
> Does anyone know what exactly sequence divergence estimates mean in
> terms of base pair mismatch?  For instance, does a figure of 12%
> divergence between the same stretch of DNA in two bacterial strains mean
> that 12 percent of the bases in identical positions in the two strands,
> are different?  

	Yes.  But all such similarity measurements are 
dependent on aligning the sequences first.  And one can
change the "optimal" alignment score by varrying the 
gap penalties and other parameters (for example treating
an transition as less of a penalty than a transversion.

> By this criterion, two completely unrelated sequences
> should display a theoretical maximum of 75% divergence, if the sequence
> was long enough (with 4 possible bases, mismatches would occur in 3 out
> of 4 cases).

	Right.  But on top of the raw identity score other
scores of sequence divergence can be used.  For example
the greater the dissimilarity between two sequences which
shared a common ancestor hundreds of thousands of years ago,
the greater the likelyhood that many of the sites have 
mutated more than one time between the two.  A G-G match
now, may reflect an unchanged site, or one which has changed
from G to C to A and back to G in one sequence and from
C to A and then to G in the other sequence during the
history of the genes.  This is where programs used in 
phylogenetic analyses use models and calculations to
predict a "phylogenetic distance" that is greater than
the simple % mismatch.

	
> 
> Is this the method used to generate similarity estimates? Any assistance
> in this would be welcome.

	Another useful measurement in protein coding regions is
to compare the divergence at silent sites to the divergence at
sites which change amino acids.  This can give a measure of the
selective presure on a protein.  For example a gene such as
DNA polymerase is very important to survival and tends to have
at least some regions of very strongly conserved protein
sequence, so even after hundreds of millions of years of evolution
the protein sequence (and hence the nonsilent sites) have
remained the same, while the silent sites have been saturated
with mutations and approach that theoretical 75% divergence.
A protein that shields a pathogen from immunological 
detection, such as a viral envelope protein, may actually
be under selective pressure for change, and the nonsynonymous
mutations will be equal to or greater than the synonymous
mutations.
	Using this measurement can help give a better
estimate of phylogenetic distance, or time of evolution 
since the genes shared a common ancestor.  Two DNA polymerase
genes with 80% DNA sequence identity are likely to have
diverged longer ago than two viral envelope genes with
80% sequence identity.  

> 
> --
> Paul D. Roughan

-- 
 ____________________________________________________________________
|Brian T. Foley               btf at t10.lanl.gov                       |
|HIV Database                 (505) 665-1970                         |
|Los Alamos National Lab      http://hiv-web.lanl.gov/index.html     |
|Los Alamos, NM 87544  U.S.A. http://www.t10.lanl.gov/~btf/home.html |
|____________________________________________________________________|




More information about the Mol-evol mailing list