? tool to remove redundancy from sequence set

Steven Brenner brenner at mole.bio.cam.ac.uk
Wed Jul 3 08:16:51 EST 1996

Arlin Stoltzfus <arlin at is.dal.ca> writes:
>Does anyone know of a tool for pruning a set sequences for
>redundancy?  Something that, given a set of sequences (say,
>in a FASTA archive) and a user-defined value of X, would 
>return a subset of sequences such that no two sequences 
>were more than X% identical.  

There are two subtle issues which are involved here:

* There are many, many ways of producing such a set.  You could try to
produce the largest such set, the smallest such set, the best
(depending upon some quality measure) set.  Which one you want depends
upon your ultimate use of the data.

* If the percent identity not very large (e.g., maybe 50%), identity
is a very poor measure of propinquity.  You'd do far better using
either smith-waterman score or (better yet) a statistical measure.

I have code which could be for producing some such sets.

Steve Brenner

Steven E. Brenner                    | S.E.Brenner at bioc.cam.ac.uk 
MRC Laboratory of Molecular Biology  | 
Hills Road                           | Office:   +44 1223 248011
Cambridge CB2 2QH, UK                | Fax:      +44 1223 213556

More information about the Bio-soft mailing list