Sequence Searching Algorithms
robison1 at husc10.harvard.edu
Mon Oct 5 15:31:53 EST 1992
Someone posted a query recently about how to search for a
defined site (such as a restriction enzyme recognition site)
in a biological sequence, and the one reply I saw mentioned
simple scanning algorithms. I would just like to point out
that while such algorithms are both easy to write and commonly
used (every searcher I have written has used one), a more elegant
and sometimes faster algorithm is a Finite State Machine. In brief,
a Finite State Machine (FSM) improves on a simple scanning algorithm by
knowing when it can "skip ahead".
For example, suppose you wish to look for the site "AATTTCCC" in DNA.
Suppose a stretch of DNA contained the sequence "AATTTGG". A simple
scanner would compare (| - match, * - mismatch):
try #1 try #2 try #3
AATTTCCC AATTTCCC AATTTCCC
|||||* |* *
AATTTGG ATTTGG TTTGG
Now, we humans know that once #1 has failed, you can skip past the whole
AATTTGG segment because "ATTTGG" cannot overlap "AATTTCCC" in any way.
An FSM incorporates this idea, and thus can be faster.
If you can find the journal, a very good description of FSMs can be
Tyler, EC, MR Horton, and PR Krause. 1991. Comp.&Biomed Res. 24:72-96.
A review of algorithms for molecular sequence comparison.
Another relevant reference is:
Altschul, SF, W Gish, W Miller, EW Myers, and DJ Lipman. 1990. JMB 214:1-8.
Basic local alignment search tool.
as BLAST uses an FSM in the initial stages of the database search.
Department of Cellular & Developmental Biology
Department of Genetics / HHMI
robison at ribo.harvard.edu
P.S. Apologies for not posting this as a follow-up -- I was rushing to
get a poster ready when the original showed up and the local
news server has apparently "expired" it.
More information about the Bio-soft