bob.gross at dartmouth.edu (Bob Gross) writes:
Greetings... Your query actually seems to split into several rather
unrelated sections, so I'll address each individually...
>My colleagues in Math and Computer Science and I are developing a new grouping
>algorithm that allows us to take large numbers of sequences and assign them to
>individual groups. All the sequences in each group are related to each other.
>What we want to do is take peptide sequences that are translated from single
>exons and try to group them.
So, you want to cluster a large number of sequences in to homologous
groups. This is an interesting problem, and there's recently been a
lot of work on it.
>This would allow us to study the possible relationships among protein domains
>that are coded for on individual exons, which might be descendants of the same
>primordial exon. Such information might shed some light on evolutionary
>processes and might help in understanding the properties of newly sequenced
>genes. Ultimately it might provide a database of exons that share common
I'm not sure why here (and above) you're focusing on exons. Whether
exons are ancient or not is a matter of much debate. I haven't
followed very recent developments, but you should have a look at
Stoltzfus et al. for a strong argument that introns are
(comparatively) recent. In any event, all parties seem to agree that
modern exons don't trivially correlate with any ancient ones.
...which leads to your next question:
>Of course we are aware of databases like the Prosite database that contain many
>motifs, but these motifs are usually quite short and probably do not represent
>whole functional domains on proteins. Rather, they often represent short
>targets, e.g. glycosylation sites, phosphorylation sites, etc. However, there
>are some true "domains" such as ATP binding, G-protein GTP binding, DNA
>binding, etc. My question to this group is what "domains" would you start off
>with in testing the grouping algorithm - based on your biological knowledge?
>The three above are starting points. What is of interest are those domains that
>are coded for on single exons and that share a common function even thought the
>exons might reside on different genes.
If I understand you, the question is what type of domain should you
I think you should really focus on are individual structural domains.
Structural domains of proteins are clearly 'modules' of evolution, and
any given domain may have distant homologs in otherwise unrelated
proteins. Hence the unit of evolution is frequently no larger than
the protein structural domain. Furthermore, we have seen very few
cases where regions of proteins smaller than a structural domain
(include those which are exon-sized) are homologous in otherwise
Protein structures have two crucial advantages over sequences. First:
you can look further back in history with them. In some cases,
similar protein structures may indicate a very distant evolutionary
relationship between two proteins with insignificant sequence
similarity. Second: if proteins are of different structure, you can
say with excellent confidence that they are unrelated. By contrast,
sequences which are dissimilar may have simply diverged too far to be
recognizable. So, structures allow you to identify distant
relationships and to reject relationships as well. Sequences can't.
To look at structural domains, I would suggest you have a look at the
scop: Structural Classification of Proteins database, which
hierarchically organizes all proteins of known structure according to
their structural and evolutionary relationships. You can access the
Good luck with your work!
Steven E. Brenner | S.E.Brenner at bioc.cam.ac.uk
MRC Laboratory of Molecular Biology |
Hills Road | Office: +44 1223 248011
Cambridge CB2 2QH, UK | Fax: +44 1223 213556