Death of the GenBank floppy format?

James P. H. Fuller jim at crom2.rn.com
Sat Aug 3 10:56:15 EST 1991


kristoff at genbank.bio.net (David Kristofferson) writes:

> Yes, your "translation" is correct.  The April 1992 release (71) is
> the LAST release of the floppy disk format files, compressed format
> files, binary files, or whatever other name you wish to call them by.
> CDROMs for releases 72 in June and 73 in September will contain the
> latest tape format (i.e., ASCII) flat file release, a repeat of floppy
> format release 71, and the latest available GenPept data in ASCII tape
> format.
>
> This decision was made at the June joint GenBank/NCBI advisors meeting
> after first consulting NIGMS, NCBI, and the one commercial developer
> that uses the format.  There is no doubt that some people will find
> this change disconcerting, as always happens with any change, but the
> time was ripe to "bite the bullet on this" as advancing technology is
> making continued support of this format far less attractive than other
> options.

     Thanks very much for your reply.  I don't find the loss of any given
file format "disconcerting" but I do see a need for *some* compressed
version of GenBank -- and the other large biological databases too, for
that matter.  Few people have hard disks of infinite extent (on their
*personal* systems, anyway ;-)



> NCBI, the National Center for Biotechnology Information at the
> National Library of Medicine in Bethesda (and the party responsible
> for the future of GenBank from October 1992 on), has held developers'
> meetings over the last year or so and now has a CDROM in beta release
> as was announced by Dennis Benson recently on the BIO-SOFTWARE/
> bionet.software newsgroup.

     Yes, I remember his announcement of the NCBI "Entrez: Sequences"
CD-ROM not long ago but nothing in the announcement gave any hint that
he was talking about the intended *replacement* of the current GenBank
distribution(s).  After all, he said that a general distribution of
"Entrez: Sequences" is planned for the fall of *this* year, while the
current GenBank contract doesn't expire until the fall of *next* year
and I presume that IntelliGenetics will continue their usual GenBank
distribution until then.  I expect Dennis Benson would have startled a
few more people like me out of somnolence had he said "This is the wave
of the future, and it's about to roll over YOU...."



> If these issues concern you, you should ***MAKE SURE THAT YOU STAY
> INFORMED*** about developments at NCBI.  I also invite NCBI to utilize
> the BIOSCI newsgroups to elaborate further on their plans.  Although
> NCBI has maintained a mailing list, bits at bio.nlm.nih.gov, less than 40
> messages have been posted to that forum in the last two years.

     The "Entrez: Sequences" announcement noted that
     
> A mailing list is now being assembled for individuals  who  would
> be  interested  in  participating in the CD-ROM evaluation or who
> would like to stay informed of the availability of  subscriptions
> to NCBI CD-ROMs.

but gave no hint as to how to get on this mailing list except by request-
ing an evaluation copy of the CD-ROM, which it also said was in extremely
short supply.  For those like Dr. Steffen who expressed an interest in
what NCBI may be plotting, the announcement did give an address for "any
further questions" -- info at ncbi.nlm.nih.gov.  I have written to this
address to ask how to get on the mailing list and will post any reply
I receive.  In the meantime, concerning the "bits at ..." list you mention:
is there a corresponding "bits-request at ..." address to which one should
send "please subscribe" messages?



> BIO-SOFTWARE/bionet.software is a widely read international forum and
> would be a more effective vehicle for communication.  I am sure,
> however, that despite all good efforts at public education there will
> still be many who are caught completely unawares by impending changes
> 8-(.  Hopefully the more messages that are sent out, the smaller the
> number of "surprises" will be.

     Amen!
     
     

> The "compressed" files are NOT ZIPed or "compressed" by any widely
> available compression utility.  There is no possibility of "multiple
> compressions" being produced by utility programs.  That is why I use
> the term "floppy format" files instead of "compressed" format because
> this is a special binary format created by software at GenBank.  This
> software was inherited from the first contractor in 1987 and has been
> patched and revised many times as the database grew and "broke" the
> the code.  The code is not worth maintaining and should probably be
> completely rewritten *if* resources were to be devoted to this.
> However, NCBI has other format plans which make far more sense than a
> continuation of this obsolete format. 

     I didn't mean to imply that the files had simply been sat on by
somebody's Huffman or Lempel-Ziv elephant.  A number of different tech-
niques are used to arrive at the floppy-format files, including some
compression and some elimination of data, and these (or rather their
inverse) are well demonstrated and commented in GBTAPE.PAS.  I agree
that you could certainly do better.  The FLOPPY67\DATA directory on the
r.67 CD-ROM is about one third the size of the GBTAPE67 directory, and
this is achieved in part through loss of information.  Out of curiosity
I ZIPped GBPHG.SEQ; the output was a file .311 times the size of the
original, which is a slightly better reduction in size with NO loss of data.
My point about multiple versions is that users who need SOME Honey-I-Shrunk-
The-Data version of these giant files will first look to the distribut-
ing authority and if none is provided they will make their own.  And
some will use ZIP, and some will use LHARC, and some PAK or ARC or ZOO
or Unix pack or compress -- hence multiple versions and lots of amusing
chaos.  It would be *very* nice to know what NCBI has in mind; but
judging by the "Entrez: Sequences" announcement they're thinking bigger,
not smaller.

     Thanks again.
                                                 -- Jim
                                                 
 
crom2 Athens GA Public Access Unix   |  i486 AT, 16mb RAM, 600mb online
   Molecular Biology                 |  AT&T Unix System V release 3.2
   Population Biology                |  Tbit PEP 19200bps  V.32  V.42/V.42bis
   Ecological Modeling               |    admin: James P. H. Fuller
   Bionet/Usenet/cnews/nn            |    {jim,root}@crom2.rn.com




More information about the Bio-soft mailing list