Determining/Removing Simmilar Sequences

Alan Williams Alan at
Mon Dec 14 17:37:20 EST 1998

I am doing some analysis regarding the prevelance and position
of short sequence motifs from large sequence datasets using 
the genbank flat files.  The particular areas that I am 
interested in are non-coding regions proximal to coding regions 
(ie NTR & UTR). How are others handling simmilar sequences in 
the data set?  My guess is that sets of sequences with greater 
than X% identity should be reduced to one sequence to prevent
biasing the statistical analysis.  So some specific questions:

(1)  When dealing with 1000 to 20,000 sequences is it necessary to remove
     nearly identical sequences? In your experience does it make a 
     difference or would just reporting the degree of near identity in
     the dataset sufficient?
(2)  How would you go about determining the degree of nearly identical
     sequences in a dataset? (To report along with the analysis.)
(3)  What would a good cutoff value be for defining "nearly identical"?
(4)  What software is freely available to do this sort or determination
     of near identity and pruning?


Alan Williams           (finger alan at for pgp public key)
University of California, Riverside   "Where observation is concerned,
Dept. of Botany and Plant Sciences     chance favors the prepared mind."  
Alan at                               -- Louis Pasteur

More information about the Bio-soft mailing list