Splitting EMBL cumulative update file

Peter Stoehr stoehr at ebi.ac.uk
Tue Jul 20 04:10:46 EST 1999

Several sites have reported problems handling our EMBL cumulative update file
when it grows beyond 2GB (uncompressed).
The problems are either at the operating system level where the filesystem does
not support >2GB files (sigh...), or SRS5 which does not seem to like indexing
a file >2GB or so. Perhaps other application software has sproblems also.

In *addition* to the cumulative.dat.gz files we make available on our FTP
server, we now also split that dataset into smaller parts, namely:

cum_1.dat.gz           !all the cumulative file except the data in files below.
cum_est1.dat.gz        !ESTs
cum_htg1.dat.gz        !HTG
cum_hum1.dat.gz        !human (non-gss, non-HTG, non-EST)
cum_gss1.dat.gz        !GSS

We chose this data division:
- because it represents the current major division of data in the updates, or
  our best guess about what will happen in the coming months.
- to try to keep the files below 2GB for as long as possible, eg before needing
  cum_hum2 etc
- to avoid frequent changes. These filenames should last a while and
  contain some data even soon after a new full release. SRS icarus files (for
  example) should not need constant changing, and when they do we should be able
  to see it coming well beforehand.

Those who 'mirror' our ftp.ebi.ac.uk/pub/databases/embl/new directory may wish
to have either the full cumulative.dat.gz or the cum_* set ignored, in order
to save some netwidth. The data contained is the same - we build the cum_*
files from the cumulative.dat.gz file afresh each day. There is no
correspondence between, say, cum_est1.dat and the est1.dat file of the full

FYI the EBI SRS server now has the EMBLNEW dataset using these cum_* files.

Peter Stoehr

More information about the Embl-db mailing list

Send comments to us at biosci-help [At] net.bio.net