Databases of less than N% similar proteins (or portable Smith Waterman)

Michael Wise michaelw at
Tue Jul 30 20:10:40 EST 1996

For a project I currently have underway, I require
a protein database which is not only non-redundant in
the sense (say) of OWL, but also in which any pair of proteins
is no more than N% similar (say 75%). The only thing I can
think of is HSSP from EBI, but this is a small database of
sequences with known structure (i.e. subset of PDB).

Does anyone have such a database?

In lieu of such a database being already available:

To generate such a database is not pretty (just scoring
all possible matches - O(n^2) for ~200,000 proteins
is a large number, before you try to attempt to find the
largest such database by removing the nodes/sequences
edges/matches between remaining nodes are less than N%
(NP-complete, I'm pretty sure)

I think I have a greedy algorithm that will generate
A database of the sort required, but for this I need
a way of generating %match values. Which brings me
to the alternate request: does someone have a free-standing
Smith Waterman implementation that I can be readily
integrated into a program to generate a N% Similar
database. Ideally what I'm looking for is a C source
module that, given two char* strings or file pointers,
returns a floating point real.

(I had a look at ssearch in the fasta distribution, but
1) extracting it from its context looks difficult, and 2)
I have no idea how to scale PAM-based scores to provided the
desired %match values.

Pointers much appreciated. Please email me directly.

Many Thanks

  Dr Michael J. Wise
  Basser Department of Computer Science, F09
  Sydney University, N.S.W. 2006
  Telephone:	(+61 2) (02) 9351 4156
  FAX:		(+61 2) (02) 9351 3838
  Internet:  	michaelw at
  "We have met the enemy, and they is us"
  	- Pogo (Walt Kelly), 1970

More information about the Bio-soft mailing list