Massive Multiple Sequence Alignment tools?

Sean Eddy eddy at wol.wustl.edu
Tue May 7 07:14:42 EST 1996


In article <4me5tb$1vi at swen.emba.uvm.edu> brianf at med.uvm.edu (Brian Foley) writes:
  >Thank you very much for ideas on tools such as AMPS.
  >In addition to performing an alignment on sequences, once 
  >I have all of the sequences ready for alignment, I was
  >hoping to find a tool that would use the information obtained
  >in a BLAST or FASTA run to help me obtain the sequences and
  >clip out the region withhigh similarity to my query sequence.

You might check out hidden Markov model software. Two packages are
publicly distributed that I know of: SAM from UC Santa Cruz
(http://www.cse.ucsc.edu/research/compbio/sam.html) and HMMER from
myself at Washington University
(http://genome.wustl.edu/eddy/hmmer.html).

HMM multiple alignment algorithms are O(N) instead of O(N^2) in the
number of sequences, so they are much more efficient for huge sequence
sets. They also (in my hands) tend to be more accurate than other
popular methods for large sequence sets (though for more reasonable
numbers of sequences (10-50) I still prefer Clustal).  We've aligned
sets as large as 2000+ sequences. Your 6000 would pose no problem.

HMMs can also allow you to align to a previous smaller multiple
alignment. You can carefully hand-craft an alignment of a
representative set of sequences, then align the rest of your 6000
relative to that.

You can use an HMM built from your alignment to search for matches in
other sequences. HMMER includes four different search algorithms: one
for complete global alignment; one for Smith/Waterman local alignment;
one for finding complete matches to the HMM in longer sequences (say,
if you're trying to find several complete copies of immunoglobulin
domains in a neural cell adhesion molecule sequence), and one for
finding multiple non-overlapping Smith/Waterman local alignments.

We often use HMMER in tasks like what you seem to be describing.  Make
a small seed alignment by hand, by ClustalW, or by HMM training.
Build an HMM. Search a bunch of extracted GenBank sequences with that
HMM. Use a Perl script to parse out the coordinates of the matching
subsequences and fetch the subsequences from GenBank. Make a new
alignment, and new HMM. Re-iterate if necessary until you've got all
the sequences you want.

-- 
- Sean Eddy
- Dept. of Genetics, Washington University School of Medicine
- 660 S. Euclid Box 8232, St. Louis MO 63110, USA 
- mailto://eddy@genetics.wustl.edu http://genome.wustl.edu/eddy




More information about the Mol-evol mailing list