Massively Parallel Applications in Sequence Analysis

Bill Pearson wrp at cyclops.micr.Virginia.EDU
Mon Mar 29 13:49:12 EST 1993


In article <MacMS.25746.31051.brutlag at cmgm.stanford.edu> brutlag at CMGM.STANFORD.EDU (Douglas Brutlag) writes:
>Bill,
>
>    Isn't FASTA with optimization identical to the Smith-Waterman?  The
>optimization step in FASTDB is precisely a Smith-Waterman scoring of the top
>5,000 sequences, and hence FASTDB with optimization is a Smith-Waterman
>analysis on those sequences. ...

	No, FASTA uses a band of 32 residues for optimization.
Smith-Waterman uses both sequences in their entirety for the
optimization.  FASTA with ktup=1 and optimization is about 5 - 10 X
faster than Smith-Waterman, reflecting the fact that the average query
sequence size is about 150 - 300 residues.  With FASTA, you can either
optimize every sequence or optimize those with a score greater than a
threshold - either method works as well as Smith-Waterman.

	Regarding the gold-standard - I work with as many
superfamilies as I can find, with several members of the superfamily
(some randomly chosen), and I do comparisons with Smith-Waterman.
Since I am trying to find sequences that share a common ancestor (and
thus have a common structure), I think false-negatives are exactly
that.  There is little evidence for common structural motifs that can
be recognized by sequence comparison in the absence of a common
ancestor.  Most recently, I have moved from a "criterion" that is a
fixed function of the scores of the top-scoring unrelated sequences
(the Genomics paper) to one that balances the number of high-scoring
unrelated and low-scoring related sequences. This gives the same
results, but seems esthetically more pleasing.

	I feel pretty uncomfortable with "motifs" that result from
convergence.  I prefer to focus on common ancestry.  For me, that
solves many of the problems you mention.

Bill




More information about the Bio-soft mailing list