mpiBLAST release announcement

catfish ashlien at videotron.ca
Fri Feb 21 04:42:39 EST 2003


Aaron, 

I`m looking into different parallelized BLASTs and would be really
interested in hearing your responses to David`s queries.

Cheers, 
Adam

mathog at caltech.edu (David Mathog) wrote in message news:<20030214104034.7369b190.mathog at caltech.edu>...
> On 13 Feb 2003 10:16:15 -0000
> darling at cs.wisc.edu (Aaron Darling) wrote:
> 
> > The mpiBLAST team anounces the release of mpiBLAST, an MPI based parallel
> > implementation of NCBI BLAST.  mpiBLAST is a pair of programs that replace
> > formatdb and blastall with versions that execute BLAST jobs in parallel on
> > a cluster of computers with MPI installed. 
> 
> Is your "blastall" a shell that calls regular blastall on the compute
> nodes or did you actually manage to make the merge function coexist with the
> NCBI code?  If the latter I'm really, really, really, impressed, because
> for those of you who have not had the joy of modifying blastall let me tell
> you that the code is hideously complex.  I looked into that approach and
> gave up and took the easy way out - postprocessing the blastall
> output in a separate program.  My parallel blast version has many more
> programs (and scripts) but otherwise sounds like it does pretty much what
> yours does.  It may be obtained here:
> 
>   ftp://saf.bio.caltech.edu/pub/software/molbio/parallelblast.tar
> 
> This version uses PVM instead of MPI for running the compute node
> jobs.  It doesn't do any message passing other than "start job" and
> "job is completed".  There is a cgi front end which is designed for
> on site use - it would need minor modifications to allow off site users.
> The merge step is not built into blastall but carried out separately by
> "blastmerge", a program which was announced here some time ago.
> 
> There are also some patches to the BLAST distribution (2.2.3) which
> fix a few problems in rpsblast and extend the "gi" mechanism so that
> searches of nr/nt may be restricted by taxon id using only two files.
> (Ie, not a separate gi file for each taxon).
> 
> PHIBLAST isn't supported here.  Does yours include it?
> 
> > Because each node's segment of the database is smaller it can
> > usually reside in the buffer-cache, yielding a significant
> > speedup due to the elimination of disk I/O. 
> 
> Right.  The benefits of file caching are not irrelevant for folks
> with just one machine sitting on their desk.  My package may also
> be used for "serial parallelization" to take advantage of this effect.
> That is, if a researcher has a 10000 entry query and wants to find
> those hits in the human genome on his workstation the database typically
> won't fit into memory and the search will take forever.  This same
> search can be speeded up immensely by fragmenting the database, running
> the same query to completion on each fragment, and then merging the
> results with blastmerge.  The fragmented method is faster so long as:
> 
>   number of database fragments < ratio (uncached run time / cached run time)
> 
> Ie, if splitting the database 3 ways makes it small enough to stay in
> cache, and the ratio is 30, searching the fragments sequentially will
> be 10x faster than searching the entire database at once.
> 
> The other advantages to fragmenting N ways are that if the database does
> need to load from local disk it does so N times faster than it would have
> on a single node.  Also, and this is the primary reason I wrote my version,
> you can run databases which do not fit into a single node's memory.  Our
> 20 node cluster has 20Gb of memory.  Subtract 50 Mb for the OS and 100Mb
> for blastall (roughly) and it leaves 17 Gb to cache databases.  That's
> more than big enough to hold nt, nr, and a couple of mammalian genomes.
> But 850Mb (the free amount on each node) isn't. When the databases
> finally outgrow cluster memory then you can add more nodes. With
> unfragmented databases you have to add memory to each machine, and
> if they won't support enough memory, replace the lot of them with a model
> that does.
> 
> > It does not require a dedicated cluster.
> 
> I can't speak for your implementation but here if any other jobs run
> at the same time their load must be exceedingly well balanced.
> Since the merge step cannot complete until all nodes
> finish a CPU hog on just one node will bog down the BLAST system.
> Moreover, these other jobs can't use too much memory either or they'll
> bump the blast databases out of cache.  Since parallel blast itself is
> pretty well balanced we allow two blastalls to run at once on each node,
> one at higher precedence (for shorter jobs) and another at lower precedence
> (for longer jobs).  No adverse interactions from doing so have shown
> up so far. 
> 
> There is one other minor "gotcha" for potential users of
> either parallel BLAST.  Some linux distributions come with "locate"
> which puts "slocate.cron" into /etc/daily.  Even on our 1Gb compute
> nodes when slocate.cron runs it throws everything out of cache.  That
> was causing the first job that ran afterwards to be abnormally
> slow since it had to reload the database fragments from local disk. 
> Had the fragments been NFS mounted the hit would have been
> even greater.  Compute nodes usually don't need reindexing that often
> so move that file to /etc/monthly.  It would probably make sense to
> take it out of cron entirely.
> 
> Regards,
>  
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> ---





More information about the Bio-soft mailing list