DNA substitutions saturated?

Doug Eernisse DEernisse at fullerton.edu
Sun Dec 3 12:15:14 EST 1995

In article <DIxqu9.H54 at zoo.toronto.edu>, mes at zoo.toronto.edu (Mark
Siddall) wrote:

> This is in response to Doug Eernisse's post.
> (Hi Doug!).
> You asked about something to do with "How to deal with gaps in the alignment".
> I'd like to follow this thread here by inquiring of the 
> readership how they feel about the 2 propositions:
> 1) indel events are not observed data (that is one does not 
> observe gaps in a sequence), they are matter of inference, thus, should
> not be treated as observed data points (i.e., code them as "missing").
> 2) in order to achieve a multiple alignment, one must assign a cost
> to a gap (or string thereof), thus phylogenetic analysis of he 
> aligned data without coding for gaps is inconsistent with the
> epistemology of having gotten the alignment itself. (Can't have your 
> cake and eat it too).
> A caveat regarding #1 is that even though one is coding it as "missing"
> it will be assigned a nucleotide state in searches (just whatever is
> most optimal), and yet, this contravenes the fact that there is no state
> to be had.


  First, sorry my email intended for Steve Palumbi somehow got posted
to this group, but am pleased to see interesting discussion of indels 

  Concerning Mark's paradox, due to reasons Joe Felsenstein already 
addressed, one is often faced with the practical situation of
dealing with gaps in a parsimony analysis.  Consider treating them as 
appended present/absent characters. This was first suggested (as far as I 
know) in the PAUP 3.0 manual. I have written software that will append
such a matrix of binary characters to an alignment, treating adjacent
sites with identical gap distribution as if they were a single character.
The issue of weighting, i.e., should a 30 site shared gap be given
more weight than a single site shared gap, remains a difficulty but
weighting can be specified as with other characters if one so desires.

  Veronique Barriel and coauthor(?s) have recently published a paper
in which she advocates considering patterns of gap distribution
by sequence, as well as by site, but I don't think I can represent
(or advocate) her method here.

  One might expect that adding gaps as data should always be desirable
rather than throwing out potentially informative characters. I thought
so, but now am less convinced. Simulations I did suggested that
for a model of evolution that more or less corresponds to "typical"
models assumed by alignment algorithms, adding gaps as data slightly
improves the chances that a parsimony algorithm will get the known
tree, when the known alignment is also used in the analysis! When the
alignment is estimated with various algorithms (e.g., Clustal, the
Malign program by Wheeler and Gladstein) then including the gap matrix
usually decreased the percentage of replicates where the true tree was 
estimated by parsimony. In other words, as the alignment deteriorates, 
one should become increasingly wary of treating gaps as anything other 
than missing data.

Doug Eernisse <DEernisse at fullerton.edu>
Dept. Biological Science MH282
California State University
Fullerton, CA 92634

More information about the Mol-evol mailing list