Concerning the Determination of Homologous Sequences,
David Maddison wrote:
<"This collection of sequences produce proteins that all have the
<same function, but the sequence similarities of some of them
<to the others are very low; there are conserved residues that
<are present in some, but not all of the sequences."
<"Another way to put it is that I would like to know if this
<collection of sequences is monophyletic on the grand tree of
<all gene trees, or if it is para/polyphyletic with the intervening
<sequences being of different function. While this is
<fundamentally a phylogenetic question (presuming such a
<grand tree of all gene trees exists), it is a horrendous one
<to answer in that the sequences are so divergent that
<one can't do a normal phylogenetic analysis on them - there's
<just nothing to get a hold of."
It sounds to me like what you are really asking is if these gene
sequences are homologous. Lewin (1987 Science 237:1570) discussed this
in terms of semantic, and there have been many other authors who have
addressed the assessment of "homology" in molecular sequence data (e.g.
Patterson, 1988 Mol.Bio.Evol. 5(6):603-625). However, if what you are
trying to do is figure out a way to build a tree out of sequences that
you do not have evidence sufficient to support inferred homology, then
I doubt you'll uncover much. I recall reading a paper in Cladistics
a couple of years ago where a person randomly added nucleotide sequences
to the data matrix of their "True Tree" until they were unable to recover
the "true" topology, thus approximating saturation of change to their
data set such as the one you refer to. Although it is interesting to
know, "How crummy your data set can theoretically be and still give you
the right answers," I would suggest that this problem is not unique to
molecular sequence data, and that systematists looking at any type of
character must still justify "inferred homology." If you are truly
interested to know whether or not these genes are "monophyletic on the
grand tree," I suggest that the only rational way of finding out is to
plot them on an existing phylogeny and look to see if they show
congruence with the patterns of speciation. I think you could do this by
treating some of your larger conserved regions as single characters, and
then look to see how all of the regions fall out on the "True Tree."
Obviously if there is no existing tree, you can't do this. Also, once
you have done this you could never use these gene sequences to construct
a taxonomic phylogeny.
I have looked into this problem from the "Chaos" and "fractal
analysis" point of view asking the question, "even though the data are
saturated with change, is there still an informative signal present?" I
have found a few papers which have addressed this topic ephemerally, and
I could send you references if this is indeed the direction in which you
are heading.
Byron Adams
University of Nebraska
bjadams at crcvms.unl.edu