# Deletions in protein sequence alignments

Jeff Thorne jeff at evolution.u.washington.edu
Sat May 16 16:24:03 EST 1992

```I'll try to answer your question about assigning gap penalties
but I should admit that my opinion is probably not the majority
one.  I think that sequence alignment should be viewed in
the context of evolution because evolution is the force
responsible for divergence of sequences.  If sequences didn't
evolve, they'd all be the same and you'd have no reason to
align them.

My personal view is that most people (and programs) treat the
problem of assigning penalties to gaps or mismatches in a very ad
hoc fashion.  The penalties should depend on the probability of
an event (i.e. insertion,deletion, substitution).  Specifically,
they should be related to the logarithm of the probability.  The
probability of an event is affected by:

1.  The frequency of the type of event
2.  The amount of time since the sequences had a common ancestor

Obviously, these two things depend on the specific
sequences being studied.   You shouldn't use the same penalty
every time you align sequences.  The penalty set should
methods aren't widely used.  They do exist but have not been
popular because of the amount of computation they require.  I
think that's kind of silly.  People are willing to spend months
or years collecting their data but they often don't want to
spend an hour to analyze it.

Hirohisa Kishino, Joe Felsenstein and I have written two papers
about objective sequence analysis.  Our solution is to use a
model of sequence evolution as a basis for likelihood methods
of alignment.  This yields a natural way of assigning penalties
to gaps and to mismatches.  The two papers were in the Journal
of Molecular Evolution (Thorne, Kishino, and Felsenstein 1991
33:114-124; Thorne, Kishino, and Felsenstein 1992 34:3-16).

There are two other data adaptive of approaches that I should
mention.  Fitch and Smith (PNAS USA 80:1382-1386, 1983) use
a Monte Carlo approach to assign penalties.  Allison and
Yee (Bull Math Biol 52:431-453) use a Minimum Message Length
approach.

I (not surprisingly) have my own (subjective?) ideas about
the relative merit of each of these three data adaptive approaches.
The important point though is that all three of these approaches
are superior to the usual practice of using the same set of
weights for each set of sequences.

Sorry if this attempt at an answer is less clear and more
long-winded than you wanted  (I did try to restrain myself).
I'd have liked to be able to say "Use 5 as the penalty for
starting a gap and use 2 for the penalty for each position
that a gap continues" but that would be an irresponsible
statement.  The truth is that sequence comparison is a primitive
area at the moment.

As far as availability of computer programs, Lloyd Allison and
his co-workers have some software that you can get which uses
their method.  I'm not familiar with its details but I'm sure
it's superior to widely-used methods.  I hope our likelihood
method can be put into a user-friendly format soon.  Probably,
there is somebody who distributes an implementation of the Fitch-
Smith Monte Carlo method but I don't know where you could
get it.

Jeff Thorne
(Favorite Email Address: jeff at amanita.cit.cornell.edu)

```