In article <apccod$mpj$1 at mercury.hgmp.mrc.ac.uk>,
Tilman Lamparter <lamparte at zedat.fu-berlin.de> wrote:
>How are gaps to be treated when aligned protein sequences are taken to
>obtain distance matrices? Should the regions be excised in all sequences?
>I use the Phylip protdist program with Jones-Taylor-Thornton model or
>Dayhoff PAM matrix. I always get different results when alignments with
>and without gaps are compared.
I get asked this question a lot.
(1) All modern parsimony, distance, and likelihood programs can cope with a
gap. So don't remove them. but ...
(2) Almost no programs make use of the information provided by the presence or
absence of the gap. They just consider it missing data, as if you
forgot to record the amino acids. The exception is the growing but still
not too useful statistical literature on models including insertions and
deletions.
(3) However, even if you are not worried about that loss of information, in
practice the regions with lots of gaps are also those that
(a) tend to have higher rates of change, and
(b) tend to be badly aligned.
Which means there are some arguments on each side of the issue. A useful
and sophisticated solution to the tree alignment problem would go far to
alleviate these worries. It is a Big Need in computational molecular biology.
--
Joe Felsenstein joe at removethispart.gs.washington.edu
Department of Genome Sciences, University of Washington,
Box 357730, Seattle, WA 98195-7730 USA
---