? tool to remove redundancy from sequence set

Arlin Stoltzfus arlin at is.dal.ca
Wed Jul 3 19:13:37 EST 1996

Geoff Barton wrote:
> Steven Brenner wrote:
> > There are two subtle issues which are involved here:
> As Steve points out, this is not a straightforward problem to solve.

[subtle and complicated issues deleted]

These considerations are valid, in principle, but don't apply to
my problem.  I'm just looking at 1000's of splice junction sequences 
with the intention of analyzing informational signals, and I 
don't want the results to be biased by large sets of nearly 
identical sequences (e.g., human antibody genes).  All of the
sequence fragments are the exactly the same length, there are 
no gaps, and a simple measurement of nucleotide identity is 
sufficient to quantify the relationship between any two
sequences (unless I want to take into account base composition).

One possible solution is the CLEANUP program (brought to my 
attention by Sabino Liuni, one of the developers):

  Grillo, et al. 1996, CABIOS 12(1):1, CLEANUP: a fast computer 
  program for removing redundancies from nucleotide sequence databases.

which is dependent on GCG and NCBI libraries (I don't have the
GCG libraries, yet, but I'm working on it).

Arlin Stoltzfus
Department of Biochemistry
Dalhousie University
Halifax, Nova Scotia B3H 4H7 CANADA
(email) arlin at is.dal.ca 
(phone) 902-494-3569 
(fax) 902-494-1355

More information about the Bio-soft mailing list