Making alignments

James McInerney jamm at nhm.ac.uk
Mon Jan 19 10:26:24 EST 1998


> I've just looked at a chapter by Nick Goldman:
> Goldman N. Phylogenetic estimation. In: Bishop
> MJ, Rawlings CJ, eds.  DNA and Protein Sequence Analysis.
> A Pratical Approach. Oxford: IRL Press, 1997:(Rickwood D,
> Hames BD, eds. The Practical Approach Series; vol 171).
> 
> He writes (p297): 'DNA sequences must contain more information than
> amino acid sequences and phylogenetic estimation methods based
> on DNA are generally better developed than for amino acids.
> Consequently, I recommend the use of DNA sequences whenever the
> choice exists.'
> 

Yes, I read this also and I thought it was a bizzare thing to say.  Given that
convergences in base compositional terms (two very distant sequences converge
on a similar base composition) are widespread and also that very quickly,
synonymously-degenerate third positions become saturated with substitutions, I
cannot recommend the use of DNA sequences when these sequences can be
translated into proteins.  I wish that he had elaborated on this sentence (I
reviewed the book for SGM quarterly).

Perhaps if Nick is 'listening' to this thread, he might contribute his opinion.


> Also, concerning indels or gaps, he writes (p 283):
> 'A gap is difficult to interpret in evolutionary terms, and there
> are no reliable methods which can use the information held in patterns
> of gaps. Some phylogenetic analyses are able to use the nucleotides
> or amino acids present at positions where some sequences have gaps;
> others cannot, and those positions must be discarded from the
> data to be analysed, even if only one sequence haas a gap.'
> 
> So, should I strip my alignment of gaps? and can anyone recommend
> programs to do this?
> 

The program MEGA (look it up at http://evolution.genetics.washington.edu) has
an option of either removing all columns that contain gaps or else just remove
the gaps for any particular pairwise comparison.

Joe Felsenstein sent me an email once explaining the DNADIST from the PHYLIP
package did some nifty maximum likelihood estimation of the possible identity
of the nucleotide at a gap position.



> Another question. I've heard that a sequence dataset needs a minimum
> of 20 informative positions to be usable.That is, any position
> where at least one sequence has a different residue. Any comments?
> 

This is the "How long is a paiece of string" question.  Sometimes you can
solve a phylogeny question to your satisfaction with a short alignment and
sometimes it has to be much longer.  There is no simple answer, but the more
data the better usually (although there are instances where having more data
can give you a more _definite_ wrong answer).

Regards,

James

-- 
=========
James O. McInerney               email: J.mcinerney at nhm.ac.uk
Molec. Biol. Comput. Officer,    phone: +44 171 938 9247
Department of Zoology,           Fax:   +44 171 938 9158
The Natural History Museum,
Cromwell Road,                    
London SW7 5BD.                  
=========




More information about the Mol-evol mailing list