Sequence Searching Algorithms

Keith Robison robison1 at husc10.harvard.edu
Mon Oct 5 15:31:53 EST 1992


Someone posted a query recently about how to search for a
defined site (such as a restriction enzyme recognition site)
in a biological sequence, and the one reply I saw mentioned
simple scanning algorithms.  I would just like to point out
that while such algorithms are both easy to write and commonly
used (every searcher I have written has used one), a more elegant
and sometimes faster algorithm is a Finite State Machine.  In brief,
a Finite State Machine (FSM) improves on a simple scanning algorithm by
knowing when it can "skip ahead".

For example, suppose you wish to look for the site "AATTTCCC" in DNA.
Suppose a stretch of DNA contained the sequence "AATTTGG". A simple
scanner would compare (| - match, * - mismatch):

try #1          try #2          try #3

AATTTCCC	AATTTCCC	AATTTCCC
|||||*		|*		*
AATTTGG		ATTTGG		TTTGG


Now, we humans know that once #1 has failed, you can skip past the whole
AATTTGG segment because "ATTTGG" cannot overlap "AATTTCCC" in any way.
An FSM incorporates this idea, and thus can be faster.

If you can find the journal, a very good description of FSMs can be 
found in:

  Tyler, EC, MR Horton, and PR Krause.  1991.  Comp.&Biomed Res. 24:72-96.
	A review of algorithms for molecular sequence comparison.

Another relevant reference is:
   Altschul, SF, W Gish, W Miller, EW Myers, and DJ Lipman.  1990. JMB 214:1-8.
	 Basic local alignment search tool.

 as BLAST uses an FSM in the initial stages of the database search.



Keith Robison
Harvard University
Department of Cellular & Developmental Biology
Department of Genetics / HHMI

robison at ribo.harvard.edu 

P.S. Apologies for not posting this as a follow-up -- I was rushing to
     get a poster ready when the original showed up and the local
     news server has apparently "expired" it.




More information about the Bio-soft mailing list