HMMER w/ GCG db?

David Mathog mathog at seqaxp.bio.caltech.edu
Mon Jun 5 14:41:14 EST 2000


In article <8h8gqh$hv0$1 at news.duke.edu>, "Simon Lin" <simon_lin at med.unc.edu> writes:
>Hello,
>
>Anybody has experience of using GCG databases with HMMER?

I've just been working on that recently.  This is for OpenVMS, but I
suspect what I say will be true for Unix as well.  This is also a work in
progress, as I still have some bugs to iron out. 

HMMSEARCH will read a protein database with no problems when I specify it
like "sw:swissprot.seq".  However, that doesn't get you anywhere if you
want to use the equivalent of 

   /infile=gb:*
or 
   /infile=@whatever.lis

To deal with that I've used our very hacked up local version of TOFASTA in
a pipe, to translate the DNA to protein and feed it on the fly into
hmmsearch like: 

$ pipe tofasta/infile=gb_pr*/frames=123456/out=sys$output | -
  hmmsearch blah.hmm sys$input > results.out

1.  I had to modify sqio.c in the hmmer package to treat "sys$input" like "-" 
     on Unix.  On unix "hmmsearch blah.hmm - > results.out" should work
     without any modificatins.

2.  There's some sort of memory  problem feeding hmmsearch a really 
    large translated database in this manner - I only just today wrote to S. Eddy
    about that so don't know the solution yet.  It may or may not affect
    Unix systems.  I suspect it will cause the hmmsearch process to use a
    lot of memory though.  I also suspect that the problem is related to
    the huge "protein" fasta entries which result from just translating
    an entire genbank DNA entry.  Since it seems to depend more on the 
    composition of the data fed in than on the absolute size of the 
    database.

In order to work around the memory problem this works:

$ tofasta/infile=gb_pr*/frames=123456/out=sys$output/scommand="@dohmmsearch %s"/split=300/sfrag="frag"

Which chews off 300 entries from the database, translates in all 6 frames,  
and writes that to a temporary file.  The procedure dohmmsearch then picks 
up the file (the name goes in on the %s), feeds it into hmmsearch as a
normal file, and then deletes the temporary file and returns.  Tofasta
makes the next one, and round and round it goes. 

This definitely works, but you can't just blindly specify an "E" value
because that is proportional to the size of the database.  So if you change
the "split" value, you also have to change "E" in the dohmmsearch procedure.
(I've asked Sean if there's some way to specify the score value as a
cutoff, but have not heard back yet.) 

As I said, our tofasta has been modified, here are the added switches:

/FRames=123456      Translate into forward/reverse reading frames
                    creating new entries with names -for1, etc for each
/SPlit=N            Create a new output file every N input entries
/SName="frag"       Name to give each fragment  (required if /SPLIT)
/SCommand="@dothis %s" Command which is run on each fragment written (requires /SPLIT)

I suspect I'm going to have to add a switch to cause translated fragments
to automatically break into overlapping pieces below some maximum size.
Not many programs were written to expect a 100,000 aa protein! 

Please email me directly if you have more specific questions.

Regards,

David Mathog
mathog at seqaxp.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 







More information about the Bio-soft mailing list