EMBL 63 available

Keith Bradnam keith at thale.nott.ac.uk
Sat Jul 8 13:30:17 EST 2000


On 7 Jul 2000, Staffa.Nick wrote:

>  'Twas a real pain for me here.
> Genbank completed overnight, but between my mistakes, embl going down, and
> general slowness of downloads, it was more than 3 days.


It's easy to imagine that in the (near) future we'll all be using Gigabit
(or higher) ethernet/internet connections and will laugh at the day we
were restricted to 100 Mbits per sec...but I'm not sure how much more
(seemingly exponential) sequence database growth will occur before any
major improvements in bandwidth occur.

I was processing EMBL 63 updates today for my Arabidopsis database and
found that there were 50,000 new Arabidopsis EST sequences in one day's
update! This one update represents a third of all the sequences we
currently have and a major jump in database size (especially when you add
on associated information such as blast homologies).

Therefore...

Does anybody know if anyone has looked at developing better compression
tools purposely for EMBL/GenBank records?  I know that GenBank has only
recently moved over to Gzip but I feel that there might be something
better that could be developed. This is based on the observation that most
new EMBL sequences appear to be very long and therefore the DNA part of
the sequence entry constitutes the greatest fraction of the total file
size (particularly for the HTG sequences where there is little
annotation).

As a test, I recently took a very long DNA sequence and tried different
compression programs on it and found...

Original sequence - 203,335 bases/bytes
pack - 56,298 bytes
compress - 57,711
winzip (maximum compression setting) - 60,427
gzip (maximum compression setting) - 60,457
gzip (default) - 61,996
winzip (default) - 63,072


So in this case, an older UNIX compression tool ('pack') beats the rest by
a small margin.  However, I wrote a dead simple script to further knock
the sequence down to 50,834 bytes, i.e. 75% compression which is easily
possibly if you encode 4 DNA characters as 1 bit of an 8-bit byte.

Of course this doesn't work for protein sequences, or where there are N's
in the DNA sequence, or for all the ancillarly information in an EMBL
record, and it's only a small saving.  But apply that saving in
compression to an entire database and you might save a few hours
downloading time.

Does anybody know of any research being done on this?  I know somebody at
Nottingham University who is kind of interested, but wants to know if
there would be interest in such a compression program...and for that to
happen I guess it would have to be accepted as a standard by all major
databases and made very easy to get hold of.

I can't help feel that extra compression could be gained by further
considering some of the more frequent hexanucleotides that occur in DNA
sequences.  

Anyone have any info/thoughts/views???

Keith

P.S. I accept that in some ways this is arguing about something that might
be blown out of the water by any new developments in bandwidth...but maybe
there is some mileage in this.

~  Keith Bradnam - Developer, Arabidopsis Genome Resource (AGR)
~  Nottingham Arabidopsis Stock Centre - http://nasc.nott.ac.uk/
~  University Park, University of Nottingham, NG7 2RD, UK
~  Tel: (0115) 951 3091 









More information about the Embl-db mailing list