EMBOSS indexing of GCG genpept

David Mathog mathog at caltech.edu
Thu Oct 18 05:23:15 EST 2001


Has anybody successfully used dbifasta to index the GCG supplied genpept
database
directly?

The first problem was that the header lines looked like:

>BAA35036  GB:AB001396 E2 region [Hepatitis C virus] (ver 1)

and dbifasta listed no   matching format.  But it turned out there was a
gcgaccid
in the dbifasta code, it just wasn't in the acd file.  So I added it to
the acd file, and
it showed up in the menu, but wouldn't run (it would just start and stop
with no warning
or error messages.)

Having faced this sort of header problem about a million times before I
used
fastamungheader  (
ftp://saf.bio.caltech.edu/pub/software/molbio/fastamungheader.c )
to rewrite genpept.seq into a supported header format with all the lines
having exactly
the same length:

>GB:AB001396 BAA35036  E2 region [Hepatitis C virus] (ver 1).

and used the gcgidacc switch.  That ran to completion, generating along
the way
around a zillion lines like:

  This is a warning:  Duplicate ID skipped: Z99759

Then I put back the original genpept.seq file - not knowing what the
GCG software
might or might not be expecting in the FASTA header.

After that sequences could be retrieved with:

# seqret -sequence genpept:BAA35036 -filter
>BAA35036 GB:AB001396 E2 region [Hepatitis C virus] (ver 1)
RTNVMGGAAAITTRGFVSLFTLINSQR

but not

# seqret -sequence genpept:AB001396
Reads and writes (returns) sequences
^C  (after giving up waiting for a prompt to reappear,  versus)
# seqret -sequence genpept:wombat
Reads and writes (returns) sequences
   An error has been found: EMBLCD Entry failed
   An error has been found: Database 'genpept' : access method 'emblcd'
failed
   An error has been found: option -sequence: Unable to read sequence
'genpept:wombat'
   There is a serious problem: seqret terminated: Bad value for option
and no prompt

That is, specify nonsense and it blows up instantly, which is fine.
Specify a Genbank ID number
and it goes bonkers.   I can just picture somebody using w2h and
specifying a
genpept:ID combo and locking the server until the cows come home.  To
prevent that I'm
deleting the indices for now.

The /usr/local/share/EMBOSS/emboss.default file has this for genpept:

DB genpept [
  method: gcg
  format: fasta
  dir: $emboss_db_dir/gcggenpept
  file: *.seq
# optional parameters
  type: P
  release: 122.0
  indexdir: $emboss_index_dir/gcggenpept
]

What are you folks doing with genpept?  Right now it looks like the only
safe thing to do is
to index on the one field and ignore the other.

Thanks,

David Mathog
mathog at caltech.edu





More information about the Bio-soft mailing list