Distributing BLAST jobs

DEMICHEL Patrick demichelpatrick at qwest.net
Thu Mar 7 04:50:58 EST 2002


Hi,

<lmj at pasteur.fr> wrote in message news:a64rcd$8ko$1 at desdemone.pasteur.fr...
> In article <3C84F548.7050800 at purdue.edu>,
> Rick Westerman  <westerman at purdue.edu> wrote:
> >     This should be a common task so I suspect that someone has done it
> >before but I can not find a reference so any help is appreciated.
> >
> >      What I want to do is to distribute BLAST search requests to
> >multiple machines that are not hooked together in a unified way.
Basically:
> >
> >1) End user, via a web screen, says something like "I want to run blastx
> >through PIR on these 300 sequences."
>
> I had to do this for 30000 protein sequences against the NCBI nr protein
> database. And it was repeated every month..

   Do you have a performance analysis ?
   Where the IOs or NFS and the network a bottleneck ?

>
> >
> >2) Program 'X' takes the sequences and distributes them to computers
> >'A', 'B', and 'C' all of which have the blast program installed and the
> >databases installed locally.  Said computers could be a Condor-cluster,
> >MP machines, or other. All Unix-based though.
>
> The program 'X' in my place was 'ppmake' which uses 'PVM' (Parallel
> Virtual Machine URL:<http://www.csm.ornl.gov/pvm/pvm_home.html>). All
> machines in the pvm shared the directory with the sequences. The makefile
> was setup with all the targets as the blast output files with dependancy
on
> the sequence file.
>
> All you have to do is startup PVM on the clients machine and type ppmake.
>
>
> >
> >3) Program 'Y' picks up the results from the computers and gives them
> >back to the end user.
>
> If the machines share the same directory, no need to transfer the files
> back to the originating machine.

I imagine you think about NFS , the problem is the number of clients and the
volume of data
    you have to read or write versus the compute time.
I imagine you have multiple run per data base per machine then  at worst you
only make one transfer
    per machine. If you have a large cache maybe a big part of it can be
cached , but
    the best is probably a transfert before thr run.
If the database are relativelly static it's the best strategy.

To make the best choice To need to now :
     number of machine
     volume of NFS ios/run
     average CPU time for one run
     number of different database you read/runs

Whatever , I think it is relativelly easy to use a client-server approach
for this kind of job.
On the server , you start the server process that read the list of requests
and start a client
      on any machine that appears on the list of machines.

The client start , then make a request to the server for one task and return
the result files ,
 then request a new task.
I am surprised that this kind of simple tool does not exist because in fact
it is not
      specific to your problem but to more large class of problems where you
have
      this property "large compute time for few IO transferts" .

I have already written this kind of code for distributing a numerical
computation on heterogenous
   machines. The code is extremelly simple and could be probably adapted to
your problem
   quite easily if you show me in more details  the kind of requests you
generate.

What I imagine :
    you create a file with the IP address of all the machines where the code
is installed.
    you create a file with a line per command you want to execute
    you start the cmd : distribute cmd_file machines_file

I can even imagine a simpler approach using some script with remsh , if the
managment of requests
   is uniform and use the same program. If you tasks are very short the
client-server approach
   will be better.

Explain me what you need to exchange : I imagine result files : here maybe
NFS can usefull
    because you will have automatically all results gathered in some place.
What in fact I did in my program I transmitted the files with a very simple
protocol.
You can imagine any kind of cksum or control/logging if you job is critical.

>
> >
> >     What I want to find are programs 'X' and 'Y'.
> >
> >     One could extend this idea to Fasta searches, PFAM searches, etc.
> > Undoubtedly there are several ways to implement this; I'm not too picky
> >on how it is done.  I'm sure one of the bigger sequencing insititutions
> >has something like this but I can not seem to find the 'X' and 'Y'
program.
>
> This would work quite easily for other programs that can be controled by
> a makefile.
>
> Setting up PVM was the most timeconsuming but was easy to do. I installed
PVM
> on several DEC Alphas, Sun Solaris, and SGI (which means that I installed
> blastall on these machines, too). I opted to
> use static scheduling (instead of the dynamic scheduling of PVM).

Can you explain ?

>
> If you like, I have scripts for starting PVM on the clients, syncing the
> blast database files, and running the ppmake that I can send you if you
> decide to go this route.
>
>
> >
> >Thanks in advance,
> >-- Rick
> >westerman at purdue.edu
> >
>
> Hope this helps,
>
> --Louis
>
> --
> --
> Louis Jones                             /\
> Institut Pasteur                o      /  \
> 28, rue du Dr. Roux            /<(*)/\/    \
>
Patrick.

---





More information about the Bio-soft mailing list