ANNOUNCE: SEQIO, version 1.2 is now available

James Knight knight at cs.ucdavis.edu
Fri Jul 5 00:45:32 EST 1996


This message is to announce the release of version 1.2 of the SEQIO
package.  The package is freely available to anyone for commercial or
non-commercial use, and can be ftp'ed from the following FTP site:

   ftp://ftp.cs.ucdavis.edu/pub/strings/seqio.tar.gz

It is a gzip'ed, tar file (356K compressed) containing the package
code and documentation files. I've also set up a web site for the
package at

   http://wwwcsif.cs.ucdavis.edu/~knight/seqio.html

Also, see the description below for more information about the
package.

Major changes from version 1.1:

  * Added the GCG, MSF and BLAST program output formats.
      (including the ability to convert between non-GCG and GCG
       forms of GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF and
       IG/Stanford formats without losing any header information)
  * Added the ability to index database entries based on any (or
    all) identifiers given in the entry.  Including the new NID and
    PID numbers.
  * Added the ability to handle database identifiers (with wildcards)
    and randomly access the specified entries.
  * Added a "single entry access specification" mode for regular
    files, so that you can extract, say, just the third entry in a 
    file, or the entry whose identifier is "sp:104k_thepa".
  * Added a number of example programs to show how to use the package.
  * The file conversion program (fmtseq) now can do "big alignments"
    of BLAST output, and can do conversions between GCG and non-GCG
    forms of sequence entries without losing any header information.



For those of you who were at ISMB'96 and to whom I promised that both
this package and my new database search algorithm would be available
last weekend, the database search algorithm isn't ready yet.  (I still
haven't gotten the hang of the difference between estimates for "down
the hallway" software, i.e. software written for the folks down the
hallway, and real product quality software.)  However, it will be
ready soon.  Certainly, by the time my post-doc runs out at the end of
July.

For those of you who weren't at ISMB (or who I didn't tell about my
database search algorithm), I've developed an alternative to FASTA and
BLAST that should produce Smith-Waterman quality alignments, i.e. the
same alignments you'd get if you ran the full-blown Smith-Waterman
search, but with the speed of FASTA and BLASTP.  (It will probably
take me until the next version of this program to get to BLASTN
speeds.)


Which reminds me.  My post-doc here at UC Davis ends at the end of
July, and so I'm looking for a job (either an industry,
algorithm/software-development position or a postdoc working with
biologists).  If you have such a position or know about such a
position (that hasn't been widely advertised on the newsgroups or on
the WWW, because I've seen those), please let me know.  I would
appreciate it.

Jim

*************************
*************************


     SEQIO:  A C/C++ Package for Reading and Writing Sequences


The SEQIO package is a C/C++ package (or library) which makes reading
and writing sequences and biological databases as easy as reading and
writing files, while at the same time supporting I/O in the following
file formats:

    Raw/Plain, GenBank, PIR (CODATA), EMBL, Swiss-Prot, FASTA, NBRF,
    IG/Stanford, ASN.1 text files, GCG, MSF, PHYLIP, Clustalw, and
    output from the FASTA and BLAST suites of programs

supporting completely configurable databases, using the new BIOSEQ
standard for describing databases, like this one for GenBank:

   #
   # The GenBank Flat-File Database
   #
   >GenBank,gb:  /databases/genbank
   >Name:  GenBank
   >IdPrefix:  gb
   >Index: gbindex
   >Format:  gbfast
   >Alphabet:  DNA
       #
       # GenBank files as found at ftp site ncbi.nlm.nih.gov in /genbank.
       # 
       gbbct.seq, gbest?.seq, gbinv.seq, gbmam.seq, gbpat.seq, gbphg.seq
       gbpln.seq, gbpri.seq, gbrna.seq, gbrod.seq, gbsts.seq, gbsyn.seq
       gbuna.seq, gbvrl.seq, gbvrt.seq

       daily-nc/nc????.flat,  daily:(daily-nc/nc????.flat)

and supporting the transparent specification (as far as the program is
concerned) of single entries of databases, like "gb:humhb*" for all of
the human beta globin GenBank entries, and of single entries of any
files, like "myseqs at 3,4" or "myseqs at gb:humhba1" to specify either the
third or fourth entry of file "myseqs" or the entry in "myseqs" whose
identifier is the GenBank HUMHBA1 locus.  Also, the database entries
can be specified using the database specific identifiers (i.e.,
GenBank locus numbers, PIR entry names, ...) or using the
cross-database accession, NID or PID numbers.  Or all three, if you
want.

In addition, the distribution comes with a reimplementation and
extension of Don Gilbert's readseq program (called fmtseq).  In
addition to a much better user interface, this program also has the
ability to perform "no loss" conversions between the non-GCG and GCG
forms of GenBank, PIR, EMBL, Swiss-Prot, FASTA, NBRF and IG/Stanford
entries, and the ability to take the output from one of the FASTA and
BLAST alignments and construct a "big alignment" by lining up all of
the pairwise alignments into a multiple alignment.

There's some other stuff too, but really, don't you think that's enough? 





More information about the Bio-soft mailing list