Hi,
I have a related but slightly different problem. The program
PROTML (Adachi and Hasegawa's protein max. like. program) codes
missing data as a 21st amino acid. Thus, if you have incomplete
sequences in the same region from two taxa in your alignment,
the overlapping N's or ?'s are counted as the same sequence.
Clearly this will positively mislead any program into
showing that the two sequences with missing data are closely
related. However, in the case where only one sequence
has many N's, I do not see how the program will be positively
mislead. I can see that the total likelihood of the data
will go down (relative to having the real sequence in the
region) because many relatively improbable changes will be incorporated
into the likelihood calculation. Can anyone see if the presence
of multiple N's in this situation will cause positively
misleading topologies to result? My belief is that the
only real effect will be to lengthen the branch leading
to the taxon with the missing data.
Cheers
Andrew J. Roger