the concept of a protein family

Andrew Dalke dalke at
Mon Dec 14 10:04:31 EST 1998

Bejerano Gill <jill at> asked:
> Is the concept "Protein family" really so undefined as some of the lit.
> claims it to be?
>    1) does the term refer to a group of homologous proteins?
>       if so - how far back should their common ancestor be? is that
>       measured in time, in sequence similarity or perhaps in function
>       similarity?

  Yes to the first one, but there's no fixed definition on this.  For
that matter, there are different metrics for homology, both for
sequence similarity and structure similarity.

  Time might not be the best way to measure similarity.  Some
proteins are extremely well conserved through time while others
are much more variable.  Most phylogeny plots have time measured
in arbitrary units, though there are some where they've made an
estimate based on the likely mutation rate.

>    2) or does it refer to functionally analogous proteins? how different
>       is this concept from the previous? Is convergent evolution the only
>       reason to blame? how common is it?

  No, it doesn't.  There are proteins that are functionally similar
but by sequence and structure very different.  A standard example
is the chymotrypsin and subtilisin.  They have different sequences and
different folds but the arrangement around the catalytic triad is the
same and they have the same function.
  There are some classifications based on function, such as the Enzyme
Classification (E.C.).

>    3) Or is it simply that every database proprietor (eg PROSITE, Pfam
>     etc.) simply devises, directly or indirectly, his own "taxonomy"
>     into groups?

  When is someone "tall"?  When did a reptile become a bird? Is there
an ideal essense of a chair?  Does a dog have Zen natur-- oops, sorry,
got carried away there :)

  I think the answer is there is broad consensus, but the edges differ
based on the person.

>    4) Is it therefore for one to deduce that no real benchmark or
>     criterion for comparing any two such taxonomies exists, except
>     one's personal taste?

  See previous :)  But it turns out the various definitions are useful;
for example, they have some explanatory and predictive power. 
Sequences that are very similar almost always have similar function.
The problem is extending that to weaker and weaker similarities.

>    5) Or perhaps all these methods should be compared only in terms of
>     the use researchers find in them - say, when a new protein sequence
>     is searched against them?

  A pragmatic definition, but hopefully there is some agreement on
when to use FASTA or BLAST or Smith-Watterman, or why certain
values of pairwise similarity or gap penalty shouldn't be used.
In other words, you can't try everything and there's reasons why
it's okay to ignore some possibilities.
  As with many fields, you build up knowledge on what is useful
and what is not based on your experience and that of others.

>    6) What is exactly the picture there? It seems the wealth of
>     partial tools as well as DBs is rather overwhelming...

  That means there's a lot of data and we don't know all the answers

