I want to use some sort of clustering method (multidimensional scaling
or stochastic proximity embedding, for instance) to group nucleotide
sequences into clouds of similar sequences on a 2-D plot. These methods
require a "dissimilarity matrix", which as far as I can tell is the same
as a distance matrix (high scores mean less similarity).
I have a set of 700+ sequences that I want to group this way, but the
set:
1. contains some homologous groups, but
2. these groups are unrelated, and
3. the sequences are of different lengths
If the sequences were related and could be trimmed to the same length, I
would do an alignment and then use phylip to create a distance matrix,
but since my sequences are unrelated and cannot really be trimmed to the
same length, I am at a loss for what to do.
For a set with so many unrelated sequences of different lengths, the
only thing I have been able to think of is an all-against-all BLAST to
create a score matrix using the normalised bits score, but this gives
high scores for similarities. From there, the only thought I had was to
use the reciprocal of the BLAST score as some perverse measure of
distance.
Any ideas?
please email to isen AT plantpath DOT wisc DOT edu
Cheers,
Tom Isenbarger
---