Databases of less than N% similar proteins (or portable Smith Waterman)

Steven Brenner brenner at mole.bio.cam.ac.uk
Wed Jul 31 12:36:50 EST 1996


Comments interspersed....

michaelw at cs.su.oz.au (Michael Wise) writes:
>For a project I currently have underway, I require
>a protein database which is not only non-redundant in
>the sense (say) of OWL, but also in which any pair of proteins
>is no more than N% similar (say 75%). The only thing I can
>think of is HSSP from EBI, but this is a small database of
>sequences with known structure (i.e. subset of PDB).

>To generate such a database is not pretty (just scoring
>all possible matches - O(n^2) for ~200,000 proteins
>is a large number, before you try to attempt to find the
>largest such database by removing the nodes/sequences
>edges/matches between remaining nodes are less than N%
>(NP-complete, I'm pretty sure)

Finding the largest such database is indeed NP-complete.  There are a
variety of greedy and non-greedy approaches which can come up with
some approximation.  (Finding the largest set is analagous to vertex
cover.)

>I think I have a greedy algorithm that will generate
>A database of the sort required, but for this I need
>a way of generating %match values. Which brings me
>to the alternate request: does someone have a free-standing
>Smith Waterman implementation that I can be readily
>integrated into a program to generate a N% Similar
>database. Ideally what I'm looking for is a C source
>module that, given two char* strings or file pointers,
>returns a floating point real.

You do not want to use Smith-Waterman for this.  That would take on
the order of 200 CPU-years.  Since you're only looking to exlude
relatives with extremely high (e.g. >75%) sequence identity, you can
use BLAST with special parameters to make it less sensitive (and hence
faster).  That way, it may take about 2 CPU-years on a fast machine.
You could then postprocess the BLAST output to get the %id.

On the other hand, you should bear in mind that if you use pure
%identity, you'll be throwing away most of the information in the
sequences.  P-values (and to a lesser extent bits / SW-scores) are
much, much more informative.

-- 
Steven E. Brenner                    | S.E.Brenner at bioc.cam.ac.uk 
MRC Laboratory of Molecular Biology  | 
Hills Road                           | Office:   +44 1223 248011
Cambridge CB2 2QH, UK                | Fax:      +44 1223 213556




More information about the Bio-soft mailing list