ANNOUNCE: SEQIO - A C Package for Reading and Writing Sequences

James Knight knight at quad.cs.ucdavis.edu
Wed Feb 7 22:09:24 EST 1996


This is the initial release of the SEQIO package, a set of C functions
which can read and write biological sequence files formatted using
various file formats and which can be used to perform efficient
database searches on biological databases.  It's essentially a
successor to the "readseq" program, but geared more toward being used
in programs than just as a file conversion program (although it can do
that too.

The package currently supports the following file formats: GenBank
Flat File, PIR/CODATA, EMBL/Swiss-Prot, FASTA, NBRF, IG/Stanford,
ASN.1 text files.  More formats will be included as I can find out the
details about them.

The package is freely available to anyone and can be ftp'ed from the
following FTP site:

   ftp://ftp.cs.ucdavis.edu/pub/strings/seqio.tar.gz

It is a gzip'ed, tar file containing the package code and
documentation files.  I don't have a Web site up yet, but it's coming
soon.


What I'm looking for now are four things, 

    1) Users to begin writing programs with the package (see below
       for an example program).
    2) People who have examples and/or descriptions of other file
       formats so I can include them (it takes me on average about
       a hour per file format).  High on my list of formats to
       include are the Phylip formats, FASTA/BLAST output and any
       multiple sequence alignment formats.  A more complete list
       is given in the documentation.
    3) Information about the organization and file formats used
       by any databases out there (if you look at the documentation
       to the package, you'll see what I mean).
    4) Folks who are interested enough in getting the package to
       run on their machine that they would help me port it.  It
       currently is Unix-specific software and has been tested under
       SunOS, Ultrix and IRIX, because they are the only machines
       I have access to.  I'm willing to do as much as I can to
       get it to work on any and all machines.


The main goal of the package was to make reading and writing sequences
as easy as reading and writing normal files, as well as being able to
handle large databases like GenBank.  As an example, this complete
program takes a keyword and database name, checks all of the sequences
in the database and outputs the entries whose sequences match a keyword:

#include <stdio.h>
#include <stdlib.h>
#include "seqio.h"

int main(int argc, char *argv[])
{
  int len;
  char *seq, *entry;
  SEQFILE *sfp;

  if (argc != 3) {
    fprintf(stderr, "match keyword database\n");
    exit(1);
  }

  if ((sfp = seqfopendb(argv[2])) == NULL)
    exit(1);

  while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) {
    if (len > 0 && strstr(seq, argv[1])) {
      entry = seqfentry(sfp, NULL, 0);
      fputs(entry, stdout);
    }
  }

  seqfclose(sfp);
  return 0;
}

This program scanned all of the GenBank database, Flat File Release
87.0 (about 800MB characters, 249MB of sequence), for a randomly
generated 20 character sequence in under 8 minutes on a DEC 5000/240
(not an Alpha).








More information about the Bio-soft mailing list