Gordon D. Pusch wrote:
> I have recently found evidence that BLAST and FASTA do not properly handle
> the official IUPAC single-letter-code 'U' for selenocystiene, presumably
> because it does not appear in either the PAM or BLOSUM matrices (although
> I have not been able to rule out hard-coding as a cause).
>> Are substitution matrices available that include scores for selenocystiene?
> If not, what is the least harmful way of handling the selenocystiene character?
> Should it be changed to the code 'X' for an unknown amino acid? Or should
> it be changed to the code for another amino acid with similar chemical and
> physical properties? Would it be acceptable to change it to the extremely
> rare but still 'legal' character 'Z' for glutamine? Any other suggestions?
I am interested in this issue for EMBOSS.
It appears that a common approach is to treat 'U' as 'C'. This could
mean converting 'U' to 'C' internally, or duplicating the 'C' scores as
'U' for a matrix that does not include 'U'.
If there are not acceptable scores for 'U' then 'X' would be an
alternative, although the implementations of some algorithms may have
The use of 'Z' for glutamate/glutamine and 'B' for aspartate/asparagine
goes back to the days of protein sequencing with an amino acid analyser.
Hydrolysing all the amide bonds and then counting the molecular ratios
resulted in asparagine being hydrolysed to aspartate and glutamine to
glutamate so a code was needed to represent the resulting ambiguity.
For protein sequences derived from a DNA sequence these codes are
usually not seen, though I have come across SNPs that translate ('RAC'
for example translates as 'B' because AAC is Asparagine and GAC is