Databases of less than N% similar proteins (or portable Smith Waterman)
Steven Brenner
brenner at mole.bio.cam.ac.uk
Wed Jul 31 12:36:50 EST 1996
Comments interspersed....
michaelw at cs.su.oz.au (Michael Wise) writes:
>For a project I currently have underway, I require
>a protein database which is not only non-redundant in
>the sense (say) of OWL, but also in which any pair of proteins
>is no more than N% similar (say 75%). The only thing I can
>think of is HSSP from EBI, but this is a small database of
>sequences with known structure (i.e. subset of PDB).
>To generate such a database is not pretty (just scoring
>all possible matches - O(n^2) for ~200,000 proteins
>is a large number, before you try to attempt to find the
>largest such database by removing the nodes/sequences
>edges/matches between remaining nodes are less than N%
>(NP-complete, I'm pretty sure)
Finding the largest such database is indeed NP-complete. There are a
variety of greedy and non-greedy approaches which can come up with
some approximation. (Finding the largest set is analagous to vertex
cover.)
>I think I have a greedy algorithm that will generate
>A database of the sort required, but for this I need
>a way of generating %match values. Which brings me
>to the alternate request: does someone have a free-standing
>Smith Waterman implementation that I can be readily
>integrated into a program to generate a N% Similar
>database. Ideally what I'm looking for is a C source
>module that, given two char* strings or file pointers,
>returns a floating point real.
You do not want to use Smith-Waterman for this. That would take on
the order of 200 CPU-years. Since you're only looking to exlude
relatives with extremely high (e.g. >75%) sequence identity, you can
use BLAST with special parameters to make it less sensitive (and hence
faster). That way, it may take about 2 CPU-years on a fast machine.
You could then postprocess the BLAST output to get the %id.
On the other hand, you should bear in mind that if you use pure
%identity, you'll be throwing away most of the information in the
sequences. P-values (and to a lesser extent bits / SW-scores) are
much, much more informative.
--
Steven E. Brenner | S.E.Brenner at bioc.cam.ac.uk
MRC Laboratory of Molecular Biology |
Hills Road | Office: +44 1223 248011
Cambridge CB2 2QH, UK | Fax: +44 1223 213556
More information about the Bio-soft
mailing list