Claudio Slamovits wrote:
>> I'm working on a phylogenetic analysis using a DNA sequence
> that show different bases in some positions for the same
> taxon. I use the standard IUPAC code for the ambiguities
> in those positions (Y, R, W, etc) but in most
> cases there is a base more represented than the others (e.g.
> for Y, C appears more than T). I'd like to take advantage
> of this "extra" information in the analysis but I don't
> know how to manage it. I'd appreciate any suggestion.
I beleive the NCBI or someone has created a standard
method of encoding this type of information. Instead of
each base using one 8-bit character (A,C,T,G,R,Y etc), each
base is represented by 4 numbers (2260 would be 20% A, 20% C,
60% G and 0% T).
However, the common phylogenetic analysis packages
such as PHYLIP do not read this format, so you'd have to
write your own version of DNADIST or whatever you wanted
Beware that even programs that accept R, Y and other
codes may not all treat them in the same way. If one sequence
has an R and the other has a G, one program may treat this
as a match because G is part of the set R = (A or G). Another
program may give a 1/2 match (assuming that R= 50% A and
50% G). I don't think any of the programs treat it as a
mismatch, but I could be wrong.