Protein Sequence - Combinatoric Table

Edwin Wright ewright at fox.nstn.ns.ca
Sat Jan 22 22:28:17 EST 1994


In Message Sat, 15 Jan 1994 17:29:03 GMT,
  esr at al.* (Sonnhammer E./Durbin) writes:

>
>"Edwin Wright" writes:
>
>   Does anyone know of any research underway to produce some kind of table or
>   universal combinatoric sequence structure for all proteins, in an effort to
>   delimit the possible combinations of protein sequences?
>
>
>Edwin, 
>
>I think you would get more answers if you more explicitly say what you
>mean by your question.  Are you referring to the combinatorial nature
>of protein domains in mosaic proteins, or something else?  What would
>the table you're asking for look like and what do you mean by
>"universal combinatoric sequence structure for all proteins"?
>   Also, what kind of questions would you like to answer - something
>like "given protein domain A, which other domains has it been observed
>to be fused to?", or what?
>
>I have been involved in a project where the goal was to cluster a
>protein sequence database into families of domains, taking the
>combinatorial nature of proteins into account.  If this is any help to
>you, you can pick up the resulting clustered database (as multiple
>sequence alignments) and a preprint of the paper by anon. FTP at
>cele.mrc-lmb.cam.ac.uk in /pub/prodom.
>
>Erik Sonnhammer
>Sanger Centre
>Cambridge UK


Erik,

Thank you for replying to the subject query.  I shall endeavour to
articulate this query more explicitly:

First of all, as you well know, from a purely mathematical perspective, the
number of combinations of protein sequences (given 20 amino acids) is
enormous; e.g., for a protein sequence of say, 100 residues, the total
number of hypothetical sequences is 20^100 (or approximately 10^130).
Clearly, the total number of different protein sequences in the biosphere
(millions or possibly tens of millions, i.e. 10^6 - 10^7) is just a fraction
of this 10^130).  Presumably, the total number of possible biologically
viable protein sequences is much closer to 10^7 than to 10^130; hence, it
should be theoretically possible to delimit this smaller (i.e. 10^7) number.

Given the average protein sequence length (10^2 - 10^3 residues), it should
be possible to contruct a hierarchical table or hierarchical "universal
combinatoric sequence structure", i.e. a singular generic (but
hierarchically and combinatorially structured) protein sequence which could
account for the approximately 10^7 different protein sequences in the
biosphere.

Such a generic sequence would presumably include all known motifs (cf.
PROSITE Dictionary); a hierarchical numbering system, e.g.

                      1
                      1.1
                      1.1.1
                      2
                      2.1
                      etc.

would be imposed on both motif and non-motif seqments of the sequence - in
essence, this generic sequence would be one "supermotif", if you will.

As an example, the following hypothetical sequence segment:

                                       5
                            31 F M P F W

might be hierarchically written (albeit roughly) as:

     1          6        50         125         409         743      END
     .......... F ......  - .......  -  .......  -  ........ W  .......
                         50.1       125.1       409.1
                          M         125.1.1     409.1.1
                                    125.1.2.1   409.1.2
                                    125.2        F
                                    125.2.1
                                     P

A very crude analogy would be the tRNA nucleotide sequence (76 - 95
nucleotides) in which each nucleotide is (absolutely) numbered 1 to 76; the
inclusion of any of the remaining 19 nucleotides (depending upon the
specific tRNA sequence) is numbered at a lower (hierarchical) level than
the primary 76 nucleotides, i.e. in the case of the D-loop, the numbering
is ...17 and/or 17A, 18, 19, 20 and/or 20A and/or 20B, 21,...etc.; in the
case of the variable loop, the numbering is ...47 and/or 47A and/or...and/or
47P, 48,...etc.; as the tRNA sequence "motif" includes the allowed
nucleotides for each of the 95 possible positions in addition to the
hierarchical structure, the generic protein sequence "supermotif" would
similarly include the allowed amino acids for each of its 1,...1.1,...1.1.2,
...2,...etc. (i.e. hierarchical) positions.

NOTE: The hierarchical protein sequence structure may or may not relate to
any phylogenetic hierarchies, but would essentially reflect a "generically
inherent hierarchy" in the biosynthesis of proteins.

Given the 10^4 - 10^5 different protein sequences currrently available in
data bases, there may be sufficient data to establish the basic structure of
a generic protein sequence, with new data either corroborating the generic
combinatorial and hierarchical structure or necessitating revisions.

I suspect that delimitation of the combinatorial and hierarchical structure
of biologically viable protein sequences would provide fundamental insight
into the essential nature of proteins.

I would greatly appreciate hearing about any research even remotely related
to this concept.

Your own research "to cluster a protein sequence data base into a family of
domains, taking the combinatorial nature of proteins into account" sounds
interesting; I'll try to get your preprint throught Anonymous FTP.

Regards,
Edwin Wright
85 Spinnaker Drive, A-602
Halifax, Nova Scotia
CANADA B3N 3E3

Telephone: (902) 477-5037

Email: ewright at fox.nstn.ns.ca
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=



More information about the Proteins mailing list