Biosequences .. Software .. Molbio soft .. Network News .. FTP

# how to treat gaps in alignments for distance calculations?

Arlin Stoltzfus arlin at carb.nist.gov
Fri Nov 1 16:52:11 EST 2002

```If anyone wants an objective procedure for treating uncertain regions
in alignments (in the absence of a badly needed statistical
framework for alignment+phylogeny uncertainty), here is one.

First, get a multiple alignment by some objective procedure (see
below *).

Second, use the SOAP (stability of aligned positions) method of
Ari L=F6ytynoja & Michel Milinkovitch
(http://evol-linux1.ulb.ac.be/~aril/SOAP/) to get reliability scores
for each alignment column.  This requires having a set of alternative
alignments, because SOAP takes 1 reference alignment and a set of
N alignments (typically including the reference alignment and a set
of alternative), and computes, for each column in the reference
alignment, the frequency with which its juxtaposition of
sequences/positions occurs in the set of alignments.  Each column
thus gets a score ranging from 1/N (unique to the reference alignment)
to 1 (found in all alignments).

Third, use the reliability scores in your analysis.  For instance,
if you are doing parsimony, you can use the alignment reliability
as a character weight.  This is easy with PAUP because Ari L=F6ytynoja
has modified a command-line version of SOAP to produce NEXUS output
with a matrix of character weights that can be read directly into
PAUP.  If you are using some other method of analysis, you could
at least apply a threshold value to exclude unreliable alignment
columns. This would address the kind of problem that Jerry Learn
mentioned, by distinguishing reliable gaps from unreliable gaps.

* To choose the best alignment, we combine the first two steps
in a somewhat tedious process that is made manageable by Perl scripts.
We produce a set of a few dozen multiple alignments with a range
of gap parameters.  Then we subject *each one* to SOAP scoring.
Then we simply choose the alignment with the most reliable columns.
In effect, this is an objective method for choosing gap parameters.

Arlin
--=20
Arlin Stoltzfus (arlin at carb.nist.gov)
Research Biologist, NIST; Adj. Asst. Prof., UMBI
CARB, 9600 Gudelsky Dr., Rockville, Md 20850
ph. 301 738-6208; fax 301 738-6255; http://www.molevol.org/camel
---

```