Database file format conversion

Don Gilbert gilbertd at sunflower.bio.indiana.edu
Mon Sep 30 09:26:14 EST 1991


In article <9109301300.AA05395 at lux.think.com> jones at Think.COM (Robert Jones) writes:

>I was thinking about setting up an equivalent for sequence files and
...
>something new - but I do like the PBM approach. I'd be willing to put in a
>share of the work on something like this.

You may want to look at readseq (anon. ftp to ftp.bio.indiana.edu,
cd molbio/readseq) which converts user files among about 13 sequence
file formats.  Readseq the program does not do the job, however, for 
large data banks (Genbank et al), because it translates from one
_file_ to another (and is not optimized for large files).   What we
don't need, and what Bill Pearson was getting at I think, is multiple
copies of 100+ megabyte databases.  

What we do need is sequence analysis software that can recognize and
read data from databanks that are in one of several possible formats.
One way to do that would be for software developers to write several
format reading routines, such as Bill Pearson has done in FASTA.  

Another option is for someone to write general purpose routines (as I tried
to do with Readseq) that other developers could incorporate into
their software as needed.  However reading large databases requires
functions that are well tuned to the parent program so that
the huge amount of disk i/o time is minimized.  It is a time consuming
effort to write software to speed up database reading for your
particular software package, and since many software companies provide
a database service with correctly formatted files for
their software, they have a difficult time justifying the extra
development effort to read many formats.  This is why, I think, most
sequence analysis software only recognizes one format, either Genbank 
or EMBL or NBRF ...
 
Some people at NCBI, who are taking over some of the management of
GenBank, have even another format which new versions of GenBank will
be moved to.  They have hopes that this will become _the_
format I think, but I suspect that molbio software companies will do
just as they are now -- write databank conversion routines, then distribute
their software which reads only their favorite databank format along
with reformatted databanks (at a charge to customers).  More info on
the new NCBI format, called ASN, can be found by anonymous ftp to
ncbi.nlm.nih.gov.   There have been announcements here in the past few
months about beta test releases of a Compact Disk with Genbank/Protein
and Medline databanks in ASN format, including browsing software.  Expect
the full release of this CD in the near future (no, I don't know when).

-- Don

-- 
Don Gilbert                                     gilbert at bio.indiana.edu
biocomputing office, biology dept., indiana univ., bloomington, in 47405




More information about the Bio-soft mailing list