Superfamily Phylogenetics

higgins at embl-heidelberg.de higgins at embl-heidelberg.de
Thu Apr 8 09:49:37 EST 1993

In article <16BA913551.MCKMICP at YaleVM.YCC.Yale.Edu>, 
MCKMICP at YaleVM.YCC.Yale.Edu (Michael McKenna) writes:

> I would like to generate a phylogenetic tree of the lipocalin
> superfamily. These sequences are very distantly related, if at
> all, but share a number of common general characteristics, including
> a signal peptide, several pairs of cysteine residues, and a 
> molecular weight between 15 and 20 kD. ............

> ........ My question is- Is it possible to generate a phylogeny from
> sequence data when only dubious connections can be made with various 
> allignment algorithms. Most of the programs I have seen can do a 
> resonable job with clearly related molecules. I suspect it can't
> be done reasonably in this particular case. Any suggestions?

My experience is that it is certainly possible technically but the results
may not be very reliable.  If you do not have enough information to align
sequences comfortably, trees are usually even more difficult.   Finding 
close groupings will not be a problem but the deep branches may be

I have generated trees where the identity levels dropped below 10 percent for 
the most divergent pairs.  The trees were useful as long as I did not try to
over-interpret the deepest branches. To get the trees you need very high quality
alignments (i.e. EVEN better than you get from clustal :-)).  These have to be
made with reference to structures if they are available.  Usually structures
are not available but you may still get parts of the sequences aligned well
by trying to match the more obvious looking secondary structure elements.
This cannot yet be done automatically.   If you are lucky, you will find 
"blocks" of conserved segments with very few gaps, separated by regions that
are totally ambiguous.  These ambiguous pieces must be
removed.   Some parts of homologous proteins are simply unalignable from primary
sequence information alone.  You can guess at the alignment in these
difficult parts using an "algorithm" but the guess may not mean anything
biologically.  If you use these badly guessed at pieces, then the tree topology
may only depend on how the guess was made.  

A further problem is how to treat gaps (insertions and deletions).  I have seen
many cases where people include gaps in difficult alignments and score them as 
characters (for parsimony or distances).   You may end up with the effect of the
gaps completely outweighing the aligned residues, in determining the topology 
of the final tree.  If the tree was derived manually, then, in effect, you are 
also manufacturing the tree topology manually.  One drastic but clean solution
is to remove all sites where any sequence has a gap.  This may throw
away half your data though.

If you do manage to generate a multiple alignment with enough conserved blocks
and remove the nasty bits, actually generating a topology is the easy part.
(e.g. using bits of PHYLIP or PAUP).  Neighbor-Joining trees from distances are
fast and you can bootstrap them easily but beware that you cannot use the 
usual corrections for "multiple hits" on the distances if any of the
sequence pairs are less than about 18% identical (over the aligned regions).   

Des Higgins
EMBL, Heidelberg, Germany.

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net