? tool to remove redundancy from sequence set
gjb at bioch.ox.ac.uk
Wed Jul 3 10:07:39 EST 1996
Steven Brenner wrote:
> Arlin Stoltzfus <arlin at is.dal.ca> writes:
> >Does anyone know of a tool for pruning a set sequences for
> >redundancy? Something that, given a set of sequences (say,
> >in a FASTA archive) and a user-defined value of X, would
> >return a subset of sequences such that no two sequences
> >were more than X% identical.
> There are two subtle issues which are involved here:
> * There are many, many ways of producing such a set. You could try to
> produce the largest such set, the smallest such set, the best
> (depending upon some quality measure) set. Which one you want depends
> upon your ultimate use of the data.
> * If the percent identity not very large (e.g., maybe 50%), identity
> is a very poor measure of propinquity. You'd do far better using
> either smith-waterman score or (better yet) a statistical measure.
As Steve points out, this is not a straightforward problem to solve.
In addition to the complications that Steve mentions, you need to
worry about the lengths of your proteins. Two sequences may share
a common domain or module, but otherwise be different. If you use
a simple cutoff, you will be throwing away non-redundant
information. A solution to this is to require pairs to have
N% of residues in common as well as being X% similar.
However, if you do this, you will still
have some redundancy in the final set. If you need a non-redundant
set for statistical analysis, then you could mask out the common
part from one of each pair. If you need non-redundancy to speed a
database search, then redundancy of this type would probably not be
Geoffrey J. Barton, Laboratory of Molecular Biophysics, University of
Rex Richards Building, South Parks Road, Oxford OX1 3QU, U.K.
mailto:gjb at bioch.ox.ac.uk, Tel: +44 1865 275368, Fax: +44 1865 510454,
More information about the Bio-soft