Maximum Likelihood Analysis Question

Andrew J. Roger aroger at is.dal.ca
Sat Aug 31 10:10:50 EST 1996


Ron Kagan wrote:
> It seems that I have stepped in the middle of quite a bit of controversy
> over the molecular clock.
> 
> In the Jan. 26, 1996 issue of _Science_ 271:470-477, R.F. Doolittle et
> al. published a research article titled _Determining Divergence Times of
> the Major Kingdoms of Living Organisms with a Protein Clock_.  In this
> paper, Doolittle and coworkers used sequence data for 57 different groups
> of proteins and reasonably well-fixed divergence times to calibrate a
> molecular clock and establish a 2,000 Myr procaryote-eucaryote divergence
> date.
> 
> Would anybody here care to comment on the validity/invalidity of
> Doolittle's results based on the molecular clock?

It is well known that extreme distances are often underestimated by
distance methods which do not take into account the effect of
rate variation at different sites in the alignment. While Doolittle 
did try to account
for the proportion of invariable sites in his analysis, he made no
attempt to deal with sites which may be varying in rates (I don't think
he corrected for this-- please let me know if I am wrong in my reading
of the paper). 

I figure that if he went back to the 57 proteins, attempted to remove
invariable sites, then estimated the gamma distribution shape parameter
for each dataset individually and then applied the gamma distance correction
for each dataset that he might have ended up with a different answer. It
seems clear to me that the value he would have got would be an underestimate
because of the rate variation problem...does anyone else agree?

Part of the trouble with estimating distances like this is that we
are not using all of the information at our disposal. Typically people
use a pairwise distance measure and then correct it for multiple
hits. However, in character based analyses like parsimony and likelihood,
estimates of distances between pairs of taxa (the total sum of branchlengths
between these taxa) are probably going to be much more accurate than
doing simple pairwise distances and applying a correction. This is
because the estimates of the internal branchlengths connecting
two taxa are use information from all of the sequences intervening.
Discovery of multiple hits (even with incorrect substitution models) 
using this extra information would seem to be much more efficient
on the face of it. To support this intuitive argument, I have evidence
that for extreme distances in some proteins (like EF-1alpha) ML and
MP agree on the distance whereas distance methods (like Felsenstein's
PROTDIST using a Dayhoff subsitution model) UNDERESTIMATE the
distances. In addition, Bull has shown that in real phylogenies, 
MP estimates the numbers of substitutions better than distance
methods (and these are not even extreme distances between taxa).
Perhaps we could use more accurate distance corrections (like
the gamma correction I mentioned above) but why do this when we
can use ML and MP with simple models?

Cheers
Andrew J. Roger
aroger at is2.dal.ca
Dept. of Biochemistry
Dalhousie University
Halifax N.S.
TEL: 902 494 3569
FAX: 902 494 1355



More information about the Mol-evol mailing list