? tool to remove redundancy from sequence set

Geoff Barton gjb at bioch.ox.ac.uk
Wed Jul 3 10:07:39 EST 1996


Steven Brenner wrote:
> 
> Arlin Stoltzfus <arlin at is.dal.ca> writes:
> >Does anyone know of a tool for pruning a set sequences for
> >redundancy?  Something that, given a set of sequences (say,
> >in a FASTA archive) and a user-defined value of X, would
> >return a subset of sequences such that no two sequences
> >were more than X% identical.
> 
> There are two subtle issues which are involved here:
> 
> * There are many, many ways of producing such a set.  You could try to
> produce the largest such set, the smallest such set, the best
> (depending upon some quality measure) set.  Which one you want depends
> upon your ultimate use of the data.
> 
> * If the percent identity not very large (e.g., maybe 50%), identity
> is a very poor measure of propinquity.  You'd do far better using
> either smith-waterman score or (better yet) a statistical measure.
> 

As Steve points out, this is not a straightforward problem to solve.
In addition to the complications that Steve mentions, you need to 
worry about the lengths of your proteins.   Two sequences may share 
a common domain or module, but otherwise be different.  If you use 
a simple cutoff, you will be throwing away non-redundant 
information.  A solution to this is to require pairs to have
N% of residues in common as well as being X% similar.  
However, if you do this, you will still 
have some redundancy in the final set.  If you need a non-redundant
set for statistical analysis, then you could mask out the common 
part from one of each pair.  If you need non-redundancy to speed a 
database search, then redundancy of this type would probably not be 
a problem.

Geoff. 

-- 
Geoffrey J. Barton, Laboratory of Molecular Biophysics, University of
Oxford, 
Rex Richards Building, South Parks Road, Oxford OX1 3QU, U.K.
mailto:gjb at bioch.ox.ac.uk, Tel: +44 1865 275368, Fax: +44 1865 510454, 
ftp://geoff.biop.ox.ac.uk, http://geoff.biop.ox.ac.uk




More information about the Bio-soft mailing list