DNA substitutions saturated?

Bengt Oxelman bengt.oxelman at systbot.gu.se
Wed Dec 6 09:19:01 EST 1995


Hell, my computer hung up when I'd finished a reply to this. Well, I give
it one more try:

In article <DJ5H45.FJD at zoo.toronto.edu>, mes at zoo.toronto.edu (Mark
Siddall) wrote:

> In article <bengt.oxelman-0412951325470001 at mac38.systbot.gu.se>
bengt.oxelman at systbot.gu.se (Bengt Oxelman) writes:
> 
>              (actually I wrote this following bit)
> >> 1) indel events are not observed data (that is one does not 
> >> observe gaps in a sequence), they are matter of inference, thus, should
> >> not be treated as observed data points (i.e., code them as "missing").
> >
> >Length differences are as 'observed' as polymorphisms at 'inferred'
> >nucleotide positions.
> 
> I disagree to a point.
> Polymorphisms are not observed either.  No one sequence has more than one
> base at any given position.  Multiple sequences from multiple isolates
> may have different bases that are observed, but no one "observes"
> a polymorphism.

Come on..., you know what I meant, didn't you?

> Regardless this differs from my point.  
> My point is, change the alignment parameters and you change the 
> homology statement about gaps (in many circumstances).
> 
> Take for example:
> Taxon I AACCGTACT
> TaxonII AACT
> 
> In so far as one could get:
> AACCGTACT
> AAC-----T
> or 
> AACCGTACT
> AA-----CT
> or 
> AACCGTACT
> A-----ACT
> 
> under the same alignment parameters, these are obviously not "obersvations"
> but inferences.

Observation: Taxon I has nine positions, Taxon II four! Of course, some a
priori inference was made when the two sequences were aligned using their
flanking regions. 

> I do not think this is all that trivial and it makes me wonder about the
> veracity of coding gaps as a "fifth state".

Me too!

> The alternative is to treat them as uninformative but this really does
> nto treat them as "nothing" it treats them as one of the four observed
> states (ACGT) whatever is most parsimonious, notwithstanding that 
> none of the four observed states was observed or could rationally
> be placed in that position.
> 
> I like Dougs idea of coding gaps separately like:
> Taxon 1  AACCGTCAGTCAGT-----CGACGTACGTACGTAC 0
> Taxon 2  AACCGTCAGTCAGT-----CGACGTACGTACGTAC 0
> Taxon 3  AACCGTCAGTCAGTGGACTCGACGTACGTACGTAC 1
> Taxon 4  AACCGTCAGTCAGTGGACTCGACGTACGTACGTAC 1
> 
> But it has only limited utility and is still a matter of inference since
> if we add a Taxon 5 and Taxon 6 we could get:
> 
> Taxon 1  AACCGTCAGTCAGT-----CGACGTACGTACGTAC 
> Taxon 2  AACCGTCAGTCAGT-----CGACGTACGTACGTAC 
> Taxon 3  AACCGTCAGTCAGTGGACTCGACGTACGTACGTAC 
> Taxon 4  AACCGTCAGTCAGTGGACTCGACGTACGTACGTAC 
> Taxon 5  AACCGTCAGTCAGT---CTCGACGTACGTACGTAC 
> Taxon 6  AACCGTCAGTCAGT_GACTCGACGTACGTACGTAC 
> >
> and now what?

 Taxon 1  AACCGTCAGTCAGT-----CGACGTACGTACGTAC 0
 Taxon 2  AACCGTCAGTCAGT-----CGACGTACGTACGTAC 0
 Taxon 3  AACCGTCAGTCAGTGGACTCGACGTACGTACGTAC 1
 Taxon 4  AACCGTCAGTCAGTGGACTCGACGTACGTACGTAC 1
 Taxon 5  AACCGTCAGTCAGT---CTCGACGTACGTACGTAC 2
 Taxon 6  AACCGTCAGTCAGT_GACTCGACGTACGTACGTAC 3

Coding the length polymorphism character as unordered multistate.
More difficult cases can be imagined however:

 Taxon 1  AACCGTCAGTCAGT-----CGACGTACGTACGTAC 0
 Taxon 2  AACCGTCAGTCAGT-----CGACGTACGTACGTAC 0
 Taxon 3  AACCGTCAGTCAGTCGACTCGACGTACGTACGTAC 1
 Taxon 4  AACCGTCAGTCAGTGCACTCGACGTACGTACGTAC 1
 Taxon 5  AACCGTCAGTCAGT---CTCGACGTACGTACGTAC 2
 Taxon 6  AACCGTCAGTCAGT-GACTCGACGTACGTACGTAC 3
                         **
The 'G' of Taxon 6 can be positioned at any of the two '*' positions, but
they will lead to different synapomorphy statements. One solution could be
to code the G in Taxon 6 as ? or 'S' (G or C), or to delete the positions
entirely. Although coding an unambiguous G as an S might appear absurd, I
think this retains an optimal amount of synapomorphic information in the
data. Adding more sequences may complicate matters quickly however. I
agree with you that it would be nice to have some sort of criterion for
choosing a threshold level when these excersises become meaningless, but
in the absence of such, I think each particular case should be examined
individually. Another approach is an iterative alignment/treebuilding
method (as of Wheeler or Hein), but that would not solve the problem in
this particular case.
There may also be cases where length polymorphisms cannot be unambiguously
coded either:

 Taxon 1  AACCGTCAGTCAGT-----CGACGTACGTACGTAC
 Taxon 2  AACCGTCAGTCAGT-----CGACGTACGTACGTAC
 Taxon 3  AACCGTCAG-----GCACTCGACGTACGTACGTAC
 Taxon 4  AACCGTCAGTCAGTGCACTCGACGTACGTACGTAC
 Taxon 5  AACCGTCAGTCA-----CTCGACGTACGTACGTAC
 Taxon 6  AACCGTCAGTCAGT-CACTCGAC----GTACGTAC

I give up! :(

Bengt

-- 
Bengt Oxelman
Dept. of Systematic Botany
Carl Skottsbergs Gata 22
S-413 19 Goeteborg
SWEDEN
bengt.oxelman at systbot.gu.se 



More information about the Mol-evol mailing list