Good Day folks -
Des Higgins' reply to McKenna [Message-Id: <9304081426.AA20368 at net.bio.net>]
concerning the construction of molecular phylogenies with "low homology"
sequences was SO BEAUTIFUL that it ought to be framed! How often have I
repeated exactly the same arguments to my students and colleaques!? So often
they want the one best (simple 8^) automated) way, the "right way," to perform
a phylogenetic analysis, when, as Des so clearly describes, the most realistic
analysis will usually involve a complex interplay of methods and MUCH reliance
on your own subjective BIOLOGICAL KNOWLEDGE. Where have I heard this before?
It rings so true. His note will become required reading in our course. I'm
enclosing a copy of Des' reply for those of you who missed it the first time
around.
Have a good one, Steve T
Steven M. Thompson
Consultant in Molecular Genetics and Sequence Analysis
VADMS (Visualization, Analysis & Design in the Molecular Sciences) Laboratory
Washington State University, Pullman, WA 99164-1224, USA
AT&Tnet: (509) 335-0533 or 335-3179 FAX: (509) 335-0540
BITnet: THOMPSON at WSUVMS1 or STEVET at WSUVM1
INTERnet: THOMPSON at wsuvms1.csc.wsu.edu
Des' reply ======================================
............
>My experience is that it is certainly possible technically but the results
>may not be very reliable. If you do not have enough information to align
>sequences comfortably, trees are usually even more difficult. Finding
>close groupings will not be a problem but the deep branches may be
>meaningless.
>>I have generated trees where the identity levels dropped below 10 percent for
>the most divergent pairs. The trees were useful as long as I did not try to
>over-interpret the deepest branches. To get the trees you need very high
>quality alignments (i.e. EVEN better than you get from clustal :-)). These
>have to be made with reference to structures if they are available. Usually
>structures are not available but you may still get parts of the sequences
>aligned well by trying to match the more obvious looking secondary structure
>elements. This cannot yet be done automatically. If you are lucky, you will
>find "blocks" of conserved segments with very few gaps, separated by regions
>that are totally ambiguous. These ambiguous pieces must be removed. Some
>parts of homologous proteins are simply unalignable from primary sequence
>information alone. You can guess at the alignment in these difficult parts
>using an "algorithm" but the guess may not mean anything biologically.
>If you use these badly guessed at pieces, then the tree topology may only
>depend on how the guess was made.
>>A further problem is how to treat gaps (insertions and deletions). I have seen
>many cases where people include gaps in difficult alignments and score them as
>characters (for parsimony or distances). You may end up with the effect of
>the gaps completely outweighing the aligned residues, in determining the
>topology of the final tree. If the tree was derived manually, then, in
>effect, you are also manufacturing the tree topology manually. One drastic
>but clean solution is to remove all sites where any sequence has a gap. This
>may throw away half your data though.
>>If you do manage to generate a multiple alignment with enough conserved blocks
>and remove the nasty bits, actually generating a topology is the easy part.
>(e.g. using bits of PHYLIP or PAUP). Neighbor-Joining trees from distances are
>fast and you can bootstrap them easily but beware that you cannot use the
>usual corrections for "multiple hits" on the distances if any of the
>sequence pairs are less than about 18% identical (over the aligned regions).
>>Des Higgins
>EMBL, Heidelberg, Germany.
P.S. my apologies for taking up bandwidth with a repeated message, however, I
feel that Des' message is definitely worth repeating. SMT