mathog at caltech.edu
Tue Jul 27 06:50:22 EST 2004
A modified version of Sean Eddy's HMMER 2.3.2 is available here:
1. The PVM parts have been reworked so that they can use split
databases, which greatly reduces the CPU load on the master as
well as network traffic. In the original 2.3.2 PVM variant
on our 20 node beowulf some queries overloaded the master resulting
in NFS, yp, and other failures on the compute nodes. This
has not been observed yet with the split variant but I can't
say what might happen if you have 2000 compute nodes. Here
are a few example run times (split means PVM on 20 compute
nodes, otherwise on one compute node with database already
stored in disk cache):
hmmpfam of A1HU.pfa against pfam_fs: 102 seconds
hmmpfam of A1HU.pfa against (split) pfam_fs: 6 seconds
hmmsearch of Peptidase_M28 against swissprot: 719 seconds
hmmsearch of Peptidase_M28 against (split) swissprot: 40 seconds
hmmsearch of Peptidase_M28 against a 6 frame
translation of the (split) D. melanogaster genome: 405 seconds
2. Some of the code has been modified to make it run a little
faster, at least on Athlons.
3. It can now read BLAST formatted sequence databases directly
(allowing it to use the same databases as my parallelblast
or, I suspect, those that MPIBLAST, utilize.)
This is implemented with the blastdb_api software already released. Taxonid restriction is also supported
to the extent possible, limited by the current limitation in
NCBI taxon dmp files of only assigning one taxon to each gi, even
when that gi describes multiple species.
4. A cgi script "hmmercontrol.pl" is supplied so that all the
HMMER programs may be run through the web, and most command line
options have been implemented.
Note that this one was written for our needs - the current
handling of account names and email addresses will NOT be
sufficient if you want to serve off site users, although such
changes would not be difficult to make. You will definitely
need to modify the configuration lines at the top, since much
of that information is site specific. You might also want to
give local users higher job priorities. It uses SGE but PBS
or any other queueing system should work as well.
It does not support the graphics options used in the cgi
supplied by the PFAM/HMMER folks. However, it does support
HMMPFAM searches on 6 frame translated nucleic acid databaes
- sometimes slowly. This type of search takes 2 hours on
20 Athlon 2200MP nodes against a mammalian genome at 99% cpu
usage on each node and about 11 seconds against the ecoli genome.
5. Man pages have been modified to show the new options present in
most of the prgrams. There was no current man page for
sreformat so I could not add the new switches --omit and --retain.
6. HMMSEARCH has been modified so that it may optionally emit in
fasta format the hits it finds. These may then be fed directly
into HMMALIGN or some separate alignment program without having
to go back and extract each hit from the database.
See AAAREADME.TXT for complete installation instructions.
In a nutshell:
A. download hmmer 2.3.2
B. unpack the parallelhmmer and copy various files over
those in the 2.3.2 distribution.
C. ./configure --enable-pvm --enable-lfs --prefix=/usr/common
(or as appropriate for your site)
E. make install
F. move the PVM slaves and the extra scripts to their
G. split databases out across PVM nodes (PFAM tools supplied here,
BLAST tools in the parallelblast package).
H. set up PVM, test the PVM programs.
I. set up the *db.txt files that the cgi script needs
for a description of your split databases.
J. customize, install, and test the cgi script.
Please report bugs comments, etc.
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Bio-soft