how to treat gaps in alignments for distance calculations?

Arlin Stoltzfus arlin at carb.nist.gov
Fri Nov 1 16:52:11 EST 2002

If anyone wants an objective procedure for treating uncertain regions
in alignments (in the absence of a badly needed statistical
framework for alignment+phylogeny uncertainty), here is one.

First, get a multiple alignment by some objective procedure (see
below *).

Second, use the SOAP (stability of aligned positions) method of
Ari L=F6ytynoja & Michel Milinkovitch
(http://evol-linux1.ulb.ac.be/~aril/SOAP/) to get reliability scores
for each alignment column.  This requires having a set of alternative
alignments, because SOAP takes 1 reference alignment and a set of
N alignments (typically including the reference alignment and a set
of alternative), and computes, for each column in the reference
alignment, the frequency with which its juxtaposition of
sequences/positions occurs in the set of alignments.  Each column
thus gets a score ranging from 1/N (unique to the reference alignment)
to 1 (found in all alignments).

Third, use the reliability scores in your analysis.  For instance,
if you are doing parsimony, you can use the alignment reliability
as a character weight.  This is easy with PAUP because Ari L=F6ytynoja
has modified a command-line version of SOAP to produce NEXUS output
with a matrix of character weights that can be read directly into
PAUP.  If you are using some other method of analysis, you could
at least apply a threshold value to exclude unreliable alignment
columns. This would address the kind of problem that Jerry Learn
mentioned, by distinguishing reliable gaps from unreliable gaps.

* To choose the best alignment, we combine the first two steps
in a somewhat tedious process that is made manageable by Perl scripts.
We produce a set of a few dozen multiple alignments with a range
of gap parameters.  Then we subject *each one* to SOAP scoring.
Then we simply choose the alignment with the most reliable columns.
In effect, this is an objective method for choosing gap parameters.

Arlin Stoltzfus (arlin at carb.nist.gov)
 Research Biologist, NIST; Adj. Asst. Prof., UMBI
CARB, 9600 Gudelsky Dr., Rockville, Md 20850
ph. 301 738-6208; fax 301 738-6255; http://www.molevol.org/camel

