? tool to remove redundancy from sequence set
Geoff Barton
gjb at bioch.ox.ac.uk
Wed Jul 3 10:07:39 EST 1996
Steven Brenner wrote:
>
> Arlin Stoltzfus <arlin at is.dal.ca> writes:
> >Does anyone know of a tool for pruning a set sequences for
> >redundancy? Something that, given a set of sequences (say,
> >in a FASTA archive) and a user-defined value of X, would
> >return a subset of sequences such that no two sequences
> >were more than X% identical.
>
> There are two subtle issues which are involved here:
>
> * There are many, many ways of producing such a set. You could try to
> produce the largest such set, the smallest such set, the best
> (depending upon some quality measure) set. Which one you want depends
> upon your ultimate use of the data.
>
> * If the percent identity not very large (e.g., maybe 50%), identity
> is a very poor measure of propinquity. You'd do far better using
> either smith-waterman score or (better yet) a statistical measure.
>
As Steve points out, this is not a straightforward problem to solve.
In addition to the complications that Steve mentions, you need to
worry about the lengths of your proteins. Two sequences may share
a common domain or module, but otherwise be different. If you use
a simple cutoff, you will be throwing away non-redundant
information. A solution to this is to require pairs to have
N% of residues in common as well as being X% similar.
However, if you do this, you will still
have some redundancy in the final set. If you need a non-redundant
set for statistical analysis, then you could mask out the common
part from one of each pair. If you need non-redundancy to speed a
database search, then redundancy of this type would probably not be
a problem.
Geoff.
