DTASK: Workstations in parallel when searching biological databases
Geir Egil Hauge
geirha at ifi.uio.no
Fri Apr 16 06:55:27 EST 1993
Dtask is a new programpackage for running UNIX workstations in parallel
when searching biological databases. Currently, the Smith-Waterman
algorithm is used. (About the same as in BLAZE).
As much as 96 UNIX machines have been run i parallel with a speed of
42 million matrix cells per. second.
The package is available from anonymous ftp: 'ftp.ifi.uio.no'
It is placed in the directory: 'molbio' and is named 'dtask11s.tar.Z'.
The size of the file is about 300 KB. Complete C-source and documentation
Below I give a "copy" from the Introduction and Portability section
in the doc-file:
As mentioned above, dtask is a program package for running unix work-
stations in parallel when comparing a biological sequence against a
library of such sequences.
This is especially useful when the sequence comparison is to be done by
a socalled dynamic programming algorithm, which may be quite slow on a
The sequence comparison program in this package, dsearch, uses such an
algorithm. It finds local optimal scores according to the Smith-Waterman
method. (See explanation in dsearch.c file).
Dtask (with dsearch) has been run on as many as 96 unix workstations in
parallel. The speed was then measured to be 42 million matrix cells per
second, using a 801 element long protein query sequence against SWISS-
PROT #21. (See section 1.4, test4).
This is half the speed that was achived on a 64K Connection Machine CM-
2 in 1990, by Robert Jones, Washington Taylor IV, Xiru Zhang, Jill P.
Mesirov and Eric Lander.
Recently, Shane S.Sturrock and John F.Collins have achieved a speed of
84 million matrix cells per second on a 4096 processor MasPar MP-1 with
a query sequence of 377 elements. A peak of 130 million cells per second
can be attained with longer sequences.
Protein searches using their system, can be done through the automatic
e-mail server BLITZ at EMBL-Heidelberg.DE.
An advantage with the dtask package is that one can write his own se-
quence comparison programs using other algorithms, and easily get them
to work in conjunction with the divtask/scserv programs for running unix
workstations in parallel.
Pleace note that the workstations to be run in parallel must depart in
a common filesystem like NFS, and must also be able to do Internet socket
Dtask v1.1 can only do the following types of sequence comparison:
a) Protein query sequence against protein sequence library.
b) DNA/RNA query sequence against DNA/RNA library.
Dtask do only report: ID/description, standard deviation above mean, and
score, for at most 5000 of the best entries. The actual entries can be
extracted to a separate file, which can then be used as input to W.R.-
Pearsons ssearch program (or similar) for getting alignments.
The library must have sequence entries in Pearson/FASTA format. However
it is quite easy to get dtask to function with libraries in other for-
mats. In most cases, only the mindex program needs to get modified.
(Mindex creates indexfiles from library files).
Dtask is developed to turn workstations to account when nobody else are
using them. Parameters can be set in such a way that programs stop using
a specific workstation when someone logs into it, or when other heavy
processes starts on the machine.
It is required that dtask programs are run by some kind of administrator.
If many users on a local net of workstations have their own dtask pro-
grams running at the same time, the performance would be quite poor.
So far, dtask has been run on the following systems and machines:
SunOS 4.1.1 SUN 3/50, 3/60, 3/280, 3/480, 4/50, 4/75, 4/260,
ULTRIX 4.2A DEC 5000, DEC 3100.
IRIX 4.0.5 SGI-INDIGO
Dtask is most likely to function on BSD derived systems, since it was
developed under SunOS and ULTRIX. Dtask v1.1 do not function on system
V derived systems that do not support BSD compatible signals.
- I would be happy to receive some feedback from those who run my programs.
Geir Egil Hauge
More information about the Bio-soft