IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

DNA vs amino acid sequences

Tom Thatcher ttha at uhura.cc.rochester.edu
Mon Jun 12 08:05:31 EST 1995

In <3rghpv$hm6 at nuscc.nus.sg> mcbbv at leonis.nus.sg (Venkatesh Byrappa) writes:

> I am now in the process of generating a phylogenetic tree of actin
>sequences from some lower
>vertebrates.  I am wondering which sequence I should use - DNA or amino
>acid? Which is more appropriate for a highly conserved protein like the
> Can someone lead me to recent publications that compare the
>pros and cons
>of using DNA vis-a-vis amino acid sequences for generating phylogenetic

I have published trees for tubulin, HMGs, and histones (all from the
lab of M. A. Gorovsky).  In all cases I used the amino acid sequences,
for the following reasons:

(Note that these reasons apply mostly to highly conserved proteins
found in species that diverged long ago--like actin.  For faster-
evolving proteins the rules are probably different.)

1) CODON BIAS:  I was comparing all known sequences, including those from 
   yeast, protozoa, plants, and vertebrates.  It is known that yeasts,
   protozoa, and animals have different codon preferences, which would
   result in differences in DNA sequence related to codon bias and not
   to evolution.  Also, the protozoa use the codons TAA and TGA to encode
   glutamine, rather than STOP.  The inclusion of unique codons in a
   subset of the sequences will tend to make that subset appear more
   divergent than they really are.

2) LONG TIME HORIZON:  I was comparing sequences that have been diverged for
   possibly a billion years.  In that time, it is very likely that the
   wobble bases in the codons will have become randomized.  If you exclude
   the wobble bases, then you are really looking at amino acid sequence

3) INTRONS:  A DNA sequence comparison should only include coding sequences.
   I decided in the interest of time and sanity that I would not go into
   the DNA sequences and edit out all the introns in every sequence.

4) MULTIGENE FAMILIES: Humans contain who knows how many histone genes,
   but only one peptide sequence for H4 has ever been identified in
   humans.  If you do DNA sequences, then which genes do you include?
   How do you know they are all expressed?  If all the H4 genes that are
   expressed encode the same protein, then are DNA differences significant?

5) PROTEIN IS THE UNIT OF SELECTION: For protein-encoding genes, the object
   on which natural selection acts is the protein itself.  The underlying
   DNA sequence reflects this process in combination with species-specific
   pressures on DNA sequence (like the need for thermophiles to have DNA
   that is resistant to melting).  If function demands that a protein
   maintain a specific sequence, there still is room for the DNA sequence
   to change. (see #1).

My recommendation is, if you can, do the trees both ways and see how they
look.  For a group of species that are relatively close in time and closely
related (like all vertebrates) DNA is probably a good way to go, since you
avoid problems 1 and 2.  But check the protein anyway.  Be aware of the
problems of multigene families and be careful when you decide to exclude
or include sequences.

If you start broadening your tree to include plants or fungi, or even more,
protein is probably better.

Tom Thatcher                          | You can give a PC to a Homo habilis,
University of Rochester Cancer Center | and he'll use it, but he'll use it
ttha at uhura.cc.rochester.edu           | to crack nuts.

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net