Parallel Processors/Algorithms (Summary)

A. Parsons mbpcr at s-crim1.dl.ac.uk
Thu May 6 06:03:24 EST 1993


NetFolk,

Well all threads on my original post on parallel architectures in database
similarity searching have all but dried up.

Here is the summary (LONG = 1500 lines) including some related threads and
earlier comms I had with other folk.  It also features the controversial
Intelligenetics survey on BLAZE. 

My overall opinion is that generally people feel that using an implementation
of the S&W algorithm is:

(a) valuable
(b) delivers better quality results.
(c) the only real practicable way of implemeting S & W in a realistic timeframe
    is to use massively parallel hardware. (mpsrch and BLAZE).

This is only my impression others may come away with different ones.  PVM is an
interesting idea but given a greenfield site (we dont have lots of Unix
workstations networked - much less any with too many spare cycles) I think that
massively parallel is the way to go.  The speed aspect may be important but I
think it is the quality argument that wins the day (see Chris Uptons posts). 

Hope you find this valuable/useful/helpful and not too much of a waste of
network bandwidth.

Many thanks to all who replied!

Tony Parsons

 +----------------------------------------------------------------------------+
 | Dr.Tony Parsons,			VOX   : + 44 304 616871	              |
 | Information Technology,		FAX   : + 44 304 616670               |
 | Pfizer Central Research,				 		      |
 | Ramsgate Road,  +----------------------   e-mail   ------------------------+
 | Sandwich,       |JANET      : parsons_a at uk.co.pcr.snd01                    |
 | KENT	           |Internet   : parsons_a at snd01.pcr.co.uk	       	      |
 | CT13 9NJ UK	   |Compuserve : 100064,765    PSImail:234216700127.parsons_a |
 +-----------------+----------------------------------------------------------+

begin ;--------------------------------------------------------------------------------

From: S S Sturrock <sss at uk.ac.ed.castle>
Early Correspondence from Shane Sturrock (author of Mpsrch)

Well, MPsrch will arrive on a MasPar at Harrow (HGMP-RC) probably in the
next week or so, they will also be evaluating BLAZE.  The EMBL installation
is waiting on good network connections and an upgrade of the OS on their
machine since that MasPar was supplied by DEC.

Performance wise, MPsrch is about two and a bit times quicker, performance
of BLAZE is about 55 million cell updates a second compared with 130
million on MPsrch.  More to the point though, although BLAZE is a Smith
Waterman on the MasPar the reconstruction is by FASTDB, a word based method
which is supposedly fast but does not guarantee to get the same alignment
as the MasPar, certainly the scores would not agree, the score is the one
reported by the S/W algorithm, the alignment is FASTDB.  On MPsrch, the
reconstruction is also by full S/W and still manages to be about 10x
quicker than the FASTDB method used on BLAZE, plus the score will agree
because the algorithm is the same.  Further, I have introduced the concept
of gaps where a gap may be one or many indels thus giving the clumping
effect of affine gaps by a simple modification to the S/W algorithm for
reconstruction, this is because many paths may give the same score but
generally it is best to pick the one with least gaps (although the same
number of indels).  We still have a single penalty though since having
affine gaps increases the complexity though we will introduce them if
anyone feels they will be an advantage over the route I have chosen.

eg:

With DNA in particular -

Standard algorithm:

      ******        *  * * *   ************
      acttatattcctctatattggatgatgctggtctgat
      acttct--------a--t-g-a---tgctggtctgat

MPsrch:

      ********               **************
      acttatattcctctatattggatgatgctggtctgat
      acttctat---------------gatgctggtctgat

These two score identically but the second is obviously the best alignment
because it only has one gap rather than the five of the standard method.
--------------------------------------------------------------------------------
From: S S Sturrock <sss at uk.ac.ed.castle>

What you may be interested in is that the smallest MasPar machine is now
only 55K pounds including the workstation host, this gives you a 1024
processor MasPar box with MP1 processors, upgradable to 4096 processors,
either MP1 or MP2 (which would give over 26000 mips + 6 gigaflops) without
the need to change anything else, just a simple board swap.  MPsrch will
run on this set up, we have discussed the possibility of a package deal too
so that may be something to interest you.  At these prices it seems crazy
to go for anything else, and on the minimum configuration you will still
get 40 million cell updates a second at the moment, but should be much more
after I come back from California, basically you will get better
performance than BLAZE on a far cheaper machine.  Too good to be true?

--------------------------------------------------------------------------------
Bill Pearson (Fasta fame) writes:

>Having read some of the literature that compares various algorithms,
>the articles tend to sort into two classes:
> 
>     1) Theoretically based papers discussing the various mathematical
>        merits of one algorithm versus another. The authors are likely to
>        have an investment in their favorite tool, and it's not always clear
>        (to me) how the math translates to the real world
> 
>     2) Very general papers which look at two or three sequences and
>        draw conclusions about algorithm strengths from a limited database.
> 
>It seems to me that all the algorithms have their unique strengths and
>weaknesses- and this is going to be strongly sequence dependent. That
>is why I made the original post, in an attempt to discern the relative
>strengths of the programs for different applications.

I (with bias, no doubt) recommend the paper:

	W. R. Pearson (1991) Genomics "Searching Protein Sequence Libraries:
	Comparison of the Sensitivity and Selectivity of the Smith-Waterman
	and FASTA Algorithms"  11:635-650

It examines the performance of FASTA, BLAST, and the Smith Waterman
algorithm (which is used by BLAZE) on about 34 different superfamilies
of proteins (all superfamilies with 20 or more members).  The bottom
line is that Smith-Waterman performs better than BLAST or FASTA, but
that FASTA will perform as well as Smith-Waterman if one optimizes all
the library scores (the -o option) and search about 10X faster than
Smith-Waterman (though not, of course, 10X faster than Smith-Waterman
on a MasPar). A description of the -o option and the reference to the
Genomics paper is provided with the FASTA documentation.

	In the past 6 months I have extended these studies to 75
superfamilies and used more rigorous statistical tests, but the result
is the same, if you use the -o option with FASTA, you will do as well
as Smith-Waterman. If you don't, you won't.

	These comments apply only to protein sequence searching.  I do
not believe that there is any advantage to slower, more rigorous,
algorithms for DNA sequence searches.  For DNA, if a significant
similarity exists in a non-protein-coding sequence, BLAST or FASTA
will find it as well as, or better than, Smith-Waterman.  With DNA,
selectivity is much more of a problem than sensitivity and both FASTA
and BLAST are more selective.

	Since the protein sequence libraries are considerably smaller
than the DNA sequence libraries (and will continue to be, even after
entire genomes are sequenced), I believe that it is reasonable for
most researchers to rely on FASTA with the -o option for protein
library searches and on FASTA or BLAST for DNA searches.

	There is one place where the Smith-Waterman algorithm should
be used, however - in the preparation of alignments for publication.
For some proteins with large variable-length loops, FASTA will not
allow gaps of the appropriate size (and BLAST does not construct
alignments with gaps, although it often shows several aligned
regions).  I am considering modifying FASTA to display rigorous
Smith-Waterman alignments rather than the approximate (gaps < 16
residues) it calculates currently.  A Smith-Waterman implementation
(SSEARCH) is available with the FASTA distribution.

Bill Pearson

--------------------------------------------------------------------------------
Subject: Genbank BLAZE user survey summary
Cc: stamm at COM.MasPar
Sender: stamm at COM.MasPar

Thank you all for your patience.  Here is the survey summary
as promised.

Regards,
Rich Stamm
stamm at maspar.com


============================================================
		GENBANK BLAZE USER SURVEY

		  SUMMARY OF RESPONSES

BLAZE email server was operated by IntelliGenetics from May to September 
of 1992 funded in part by IntelliGenetics, MasPar and the National 
Institutes of Health.  A summary of the survey questions and responses 
follows. A total of 63 people sent responses by email.

1)  When you first used BLAZE did you have any previous experience
    searching sequence databases with the Smith-Waterman algorithm?

Previous experience reported:

SMITH-WATERMAN 	=  7
FASTA 		= 27
BLAST 		= 20
WORDSEARCH	=  2
GAP 		=  1
FASTDB 		=  1
EMBL/ARGOS 	=  1

    If so, what were your impressions of the algorithm?

Impressions reported 

FAVORABLE = 29 
(mentioned are speed, accuracy, sensitivity, ease of use, 
annotations, exploratory nature)

OK = 3

UNFAVORABLE = 3 
(unfavorable experiences were due to problems with email, 
software bugs, missing features)


2)  Did you achieve noticably better sensitivity using BLAZE over
    other search algorithms?  Which ones (FASTA, BLAST, FASTDB...)?

Reports of increased sensitivity

OVER FASTA 		= 15
OVER BLAST 		= 13
OVER FASTDB		=  1
OVER WORDSEARCH		=  1
NON-SPECIFIC INCREASED OBSERVED	=  7
NO ASSESSMENT DONE			= 11
SENSITIVITY INCREASE NOT OBSERVED	= 13*

*7 OF THESE 13 RESPONDEES REPORTED THAT THEIR QUERY SEQUENCES 
EITHER HAD MANY STRONG MATCHES OR (NEARLY) NO MATCHES AT ALL IN 
THE DATABASE.


3)  Did you explore the parameter space to attempt to improve the
    sensitivity of your BLAZE searches (i.e. vary the PAM matrix, 
    gap penalty and gap extension penalty)?  If so was the improvement 
    from your parameter search noticeable?

EXPLORED PARAMETER SPACE		= 9
FOUND NOTICABLE IMPROVEMENTS		= 5
FOUND NO SIGNIFICANT DIFFERENCES	= 3*

*2 OF THE 3 RESPONDEES WHO REPORTED NO SIGNIFICANT DIFFERENCES FROM VARYING
THE SEARCH



More information about the Bio-soft mailing list