Databases of less than N% similar proteins (or portable Smith Waterman)
michaelw at cs.su.oz.au
Tue Jul 30 20:10:40 EST 1996
For a project I currently have underway, I require
a protein database which is not only non-redundant in
the sense (say) of OWL, but also in which any pair of proteins
is no more than N% similar (say 75%). The only thing I can
think of is HSSP from EBI, but this is a small database of
sequences with known structure (i.e. subset of PDB).
Does anyone have such a database?
In lieu of such a database being already available:
To generate such a database is not pretty (just scoring
all possible matches - O(n^2) for ~200,000 proteins
is a large number, before you try to attempt to find the
largest such database by removing the nodes/sequences
edges/matches between remaining nodes are less than N%
(NP-complete, I'm pretty sure)
I think I have a greedy algorithm that will generate
A database of the sort required, but for this I need
a way of generating %match values. Which brings me
to the alternate request: does someone have a free-standing
Smith Waterman implementation that I can be readily
integrated into a program to generate a N% Similar
database. Ideally what I'm looking for is a C source
module that, given two char* strings or file pointers,
returns a floating point real.
(I had a look at ssearch in the fasta distribution, but
1) extracting it from its context looks difficult, and 2)
I have no idea how to scale PAM-based scores to provided the
desired %match values.
Pointers much appreciated. Please email me directly.
Dr Michael J. Wise
Basser Department of Computer Science, F09
Sydney University, N.S.W. 2006
Telephone: (+61 2) (02) 9351 4156
FAX: (+61 2) (02) 9351 3838
Internet: michaelw at cs.su.oz.au
"We have met the enemy, and they is us"
- Pogo (Walt Kelly), 1970
More information about the Bio-soft