clustering/distances for unrelated sequences

Thomas Isenbarger isen at
Wed Dec 8 15:23:08 EST 2004

I want to use some sort of clustering method (multidimensional scaling 
or stochastic proximity embedding, for instance) to group nucleotide 
sequences into clouds of similar sequences on a 2-D plot.  These methods 
require a "dissimilarity matrix", which as far as I can tell is the same 
as a distance matrix (high scores mean less similarity).

I have a set of 700+ sequences that I want to group this way, but the 

1.  contains some homologous groups, but
2.  these groups are unrelated, and
3.  the sequences are of different lengths

If the sequences were related and could be trimmed to the same length, I 
would do an alignment and then use phylip to create a distance matrix, 
but since my sequences are unrelated and cannot really be trimmed to the 
same length, I am at a loss for what to do.

For a set with so many unrelated sequences of different lengths, the 
only thing I have been able to think of is an all-against-all BLAST to 
create a score matrix using the normalised bits score, but this gives 
high scores for similarities.  From there, the only thought I had was to 
use the reciprocal of the BLAST score as some perverse measure of 

Any ideas?

please email to isen AT plantpath DOT wisc DOT edu

Tom Isenbarger

