Gaps and PAMs

L.A. Moran lamoran at gpu.utcs.utoronto.ca
Mon Jun 29 21:45:38 EST 1992


In an earlier article I pointed out that "significance" is a subjective term
and Gaston Gonnet responded,

     "yes, I agree, but with "subjective terms" we cannot do science.  
      The least controversial definition of "significance" is one which 
      relates the probability of an homology against the (null hypothesis) 
      probability of a random coincidence. As the model of homology gets 
      more precise, or you start including information of other nature 
      (e.g. 3-d structure) then the probabilities may be computed 
      differently. But the definition remains the same."

I think that you are missing the point, Gaston. Simply relating the 
probability is not sufficient. In order to claim significance you also have
to make a decision about the cutoff point. This decision is subjective.
Furthermore you make a subjective decision when you decide how to measure
similarity in the first place. I don't necessarily agree that your decisions
and assumptions concerning these measurements are valid. This is what science
is all about.

By the way, it is not necessarily true that similar 3-D structures indicate
homology.

When I said,

     "Evolutionary distance is actually measured in years or some 
      other unit of time. When comparing two sequences we can estimate 
      the distance by examining the degree of similarity."

Gaston Gonnet replied,

     "beg to disagree. Evolutionary distance, as shown by Dayhoff and 
      many other people, is best measured in PAM units or any units of 
      mutation. The reason is simple, when given just the sequences, we 
      can estimate directly their ED, but we cannot estimate their 
      time-distance without considering at least 3 of the biases which 
      affect the relation between amount of evolution and time. These are:

      (a) species reproduce at very differnt rates
      (b) crucial proteins mutate much more slowly than less important
          proteins (due to a strong natural selection)
      (c) changes in the environment "force" some rapid mutations.

      So it would be nice to measure time, but we can at best measure
      amount of evolution (amount of change)."

I suspect that you actually agree with my statement. Would you be happy to
rephrase your response to say that "Evolutionary distance ... is best
ESTIMATED in PAM units ..."? Species diverge over time not over PAM units!
Our calculations may or may not be a valid ESTIMATE of the time of divergence
but we should not lose sight of the fact that they ARE estimates with many
unproven assumptions.

Allow me to make a comment about your three biases.

     a) It is true that modern species reproduce at different rates
        but whether or not this has much effect on sequence similarity
        is still open to debate.

     b) Yes, this is true. I work with the most highly conserved proteins
        known in biology and they change at a snail-like pace compared
        to others such as the globins and cytochromes.

     c) Changes in environment cannot "force" mutations. What does this 
        mean?

I stated that the best way to detect similarity was to compare aligned
sequences directly and I pointed out that introducing gaps forces one
to select a (subjective) value for these gaps. Similarly a comparison
of non-identical residues requires a subjective decision concerning the
value of such comparisons.

Gaston responded,

     "subjective decisions about the values of gaps is what has been 
      done until recently. We have now given a model under which parameters
      can be computed from the available samples. I am afraid that you
      tend to imply that alignment is "black magic" or "art". I disagree
      strongly with this view. We should establish models, compute the
      parameters for these models, verify/reject the models against reality
      and move into better models when the old ones become unsuitable to
      describe reality. This is the way that science makes progress, not
      with "subjective measures".  There are hundreds of examples of this
      methodology in science."

With all due respect, I do not consider your "model" to be entirely objective.
I still believe that estimating the value of gaps is a difficult problem
that ultimately boils down to a "guesstimate".

And yes, I am implying that alignment is an "art". In fact I will go as far
as to say that I can do a multiple alignment better than any computer program!
I can certainly do a better job than many authors who publish alignments in
Nature or Cell or many other journals. This does not mean that we shouldn't
keep trying to write algorithms that will do the job perfectly, it simply 
means that we have a long way to go. I tend to agree with Swofford and
Olson who write,

     "Alignment is probably the most difficult and least understood
      component of a phylogenetic analysis from sequence data....
      we offer the following advice: When regions of the sequence
      are so divergent that a reasonable alignment cannot be obtained
      by manual methods using a sequence editor ("by eye"), those
      regions should probably be eliminated from the analysis."

      D.L. Swofford and G.J. Olson "Phylogeny Reconstruction" in
      MOLECULAR SYSTEMATICS, D.M. Hills and C. Moritz eds. Sinauer

(That ought to stimulate Swofford to enter this debate! (-:  )

Gaston, your comments about how science works seem to miss the point that
we progress by making hypotheses which often are no better than intelligent
guesses. Often they are wrong, sometimes spectacularly. I also have trouble
with your suggestion that we compare computed measures of similarity against
"reality". What is "reality" in this context? Can you give an example?

I said,

     "I assume that when constructing a Dayhoff matrix only identical 
      amino acids are counted in the initial alignment but that gaps 
      are permitted. Is this correct?"

Gaston replied,

     "no, you are mistaken, please read Dayhoff's original paper, the 
      procedure is much more sophisticated. If you would understand their
      ideas, you would be much more confident in using their tools."

How interesting. When you construct a new "Dayhoff" matrix do you use the
old one to improve the alignments that form the database? If not, then what
"sophisticated" assumptions do you make that justify comparing non-identical
residues in the original alignments? Do you think that these assumptions
might affect the final matrix?

By the way, I have used the Dayhoff matrix in some of my distance 
calculations. I find that it does not change the shape of the tree but it
does alter some of the distances. Since most of the variation that I see
is in regions of the protein that are not constrained, use of the matrix
is not likely to be very helpful. I also find that the original Dayhoff
matrix does not agree with the variation that I see in my alignments. 
Furthermore, my impression is that the presence of sequence mistakes in my 
database is a far more serious source of error than whether or not I use
a particular Dayhoff matrix. Others may find it more useful.

Laurence A. Moran (Larry)
Dept. of Biochemistry




More information about the Bio-soft mailing list