Nucleotide sequence simulation

Anders Gorm Pedersen gorm at cbs.dtu.dk
Mon Jul 9 03:30:48 EST 2001

Andrew Rambaut wrote:

> Anders Gorm Pedersen wrote:
>> I think you night find Hidden Markov Models (HMMs) to be useful for this
>> kind of thing. Briefly this type of model can be estimated ("trained") on
>> a set of aligned sequences, and then used in "generative mode" to produce
>> sequences having the same characteristics as the aligned set. I've [...]
> How do these models deal with phylogeny? Does the model estimate the
> phylogenetic relationships between the sequences or assume independent
> lineages (or some sort of pair-wise relationship)?

In their simplest form, hidden Markov models don't deal with the phylogeny 
at all, but rely on the unbiased (?) information that can be extracted by 
"training" the model on the alignment. Depending on how one constructs the 
model, this may include nucleotide frequencies (if working with DNA 
sequences, they can also be used for protein sequences), dinucleotide 
frequencies, trinucleotide freqs etc., they can model site-spefific rates 
of indels and substitutions, take codon-structure into account and many 
other things.

It is also possible to hardwire prior information into an HMM (average 
transition/transversion rates for instance) and possibly then refine this 
in a site-dependent manner by training on an alignment of the gene family 
being investigated.

Of course, you don't want to make your model overly complicated by having 
too many parameters (that would defeat the purpose of modelling in the 
first place). Or at least not more parameters than the size of your data 
set supports.

As mentioned, good starting points for learning about HMMs can be found on 
the website of my colleague Anders Krogh:


A good  introduction is:

A. Krogh 1998. An Introduction to Hidden Markov Models for Biological 
Sequences, In S. L. Salzberg et al., eds., Computational Methods in 
Molecular Biology, 45-63.  Elsevier. 

A few refs about evolution and HMMs:

Felsenstein J, Churchill GA., Mol Biol Evol 1996 Jan;13(1):93-104
A Hidden Markov Model approach to variation among sites in rate of 

von Haeseler A, Schoniger M., J Comput Biol 1998 Spring;5(1):149-63
Evolution of DNA or amino acid sequences with dependent sites.

Schadt EE, Sinsheimer JS, Lange K., Genome Res 1998 Mar;8(3):222-33
Computational advances in maximum likelihood methods for molecular 

Mitchison GJ., J Mol Evol 1999 Jul;49(1):11-22
A probabilistic treatment of phylogeny and sequence alignment.

McGuire G, Wright F, Prentice MJ., J Comput Biol 2000 Feb-Apr;7(1-2):159-70
A Bayesian model for detecting past recombination events in DNA multiple 

Anders Gorm Pedersen, Ph.D.  
Center for Biological Sequence Analysis, www.cbs.dtu.dk
Technical University of Denmark

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net