Question about multiple sequence alignment
btf at lanl.gov
Fri Jul 23 21:54:15 EST 1999
David Jones wrote:
> Robert C Oehmke <oehmke at engin.umich.edu> wrote:
> > Now given the fact that my program takes around 6 hours to run on
> > a parallel computer. Does it seems useful to be able to be able to compare
> > sequences exactly for this range of number of sequences and length of
> > protein, or are the approximation methods good enough to render the extra
> > work useless?
Six hours to align 6 sequences of length 200 seems
too slow to be useful to me. I typically need to align 200
to 400 sequences of length 2,000 or so.
If I am aligning HIV-1 sequences to HIV-1 sequences,
the gaps and mismtaches are relatively trivial, and the alignment
is quite believable.
When I attempt to align HIV-1 sequences to SIVs
(primate immunodeficiency viruses) and to other lentiviruses
such as equine infectious anemia virus or visna virus,
I find that the job requires biological knowledge, and not
just simple scoring schemes. The RT region of the polymerase
gene is easily aligned as it is a protein-coding region
encoding a protein with many contraints on its evolution.
Small regions of the LTR (such as the AATAAA poly-A signal)
align well, while the majority of the LTR seems to have
been totally randomized by thousands of years of evolution.
> > Note, that the program is not fixed at a size of 6 and 200
> > the number of sequences could also be adjusted down to vastly increase the
> > length of the protein.
> At the end of the day, no matter whether you use a rigorous MSA
> method or an approximation, the alignments are still not
> going to be "biologically optimal" unless the sequences are closely
I agree. But the definition of "closely related" is
open to debate.
> The additional rigour obtained from using an full N-way dynamic
> programming method is going to be swamped by the inadequacies of the
> amino acid substitution scoring scheme at the end of the day.
Again, I agree. What we really need are programs that
use site-specific scoring and a model of the evolution of the
particular region being aligned. I've had good luck with
Sean Eddy's HMMER hidden Markov model software, using an iterative
procedure of aligning with CLUSTAL, adjusting by hand, building
a model, aligning using the model, adusting by hand, building
a new model, etc...
I recently began testing DIALIGN (see ref below)
which takes DNA sequences and translates them in all 3
frames and looks for diagonals in the protein sequences to
help align the DNA. With 44 sequences of length 6,000 it
ground along for over 500 hours on a 300MHz machine and I
finally had to kill the job to free up that computer. So
I don't know if it would have produced a nice alignment or
> The bottom line is that both methods are going to produce alignments which
> are not biologically correct - so why not use the faster approximate method?
I would really like to get good alignments built.
It would help us to understand where HIV-1 and HIV-2 came from
and perhaps give insight into where they might be going.
They evolve so fast that we really want to get a handle on
which regions are conserved and likely to remain conserved
in the future, so that those regions can be targetted for
> On a more practical note, we are now facing situations where we now have
> to produce good multiple alignments for hundreds or even thousands of
> sequences. Even the faster approximate MSA programs take a long time to
> align this many sequences.
I have found that the speed is not only related to the
number and length of sequences, but also the divergence
and type of divergence. Regions that undergo lots of insertions
and deletions are harder than regions that change in sequence
but not length. Tandem duplications are a special problem, and
tend to happen quite often.
Sequencing error (including errors that happen during
PCR amplification and cloning) are also a problem when they
introduce frameshifts. They make the automation of DNA to
protein translation to aligned proteins to aligned DNA a
very difficult problem.
Any suggestions on improved methods are welcome.
> This message was written, produced and executively directed by Dr David Jones
> Address: Dept. of Biological | Email: jones at globin.bio.warwick.ac.uk
> Sciences, University of Warwick, | Tel: +44 1203 523729
> Coventry CV4 7AL, U.K. | Fax: +44 1203 523568
Genomatix Software GmbH
Tel +49-89 5490839-0
FAX +49-89 5490839-9
email genomatix at gsf.de
|Brian T. Foley btf at t10.lanl.gov |
|HIV Database (505) 665-1970 |
|Los Alamos National Lab http://hiv-web.lanl.gov/index.html |
|Los Alamos, NM 87544 U.S.A. |
More information about the Mol-evol