Splitting EMBL cumulative update file
stoehr at ebi.ac.uk
Mon Jul 19 09:48:24 EST 1999
Several sites have reported problems handling our EMBL cumulative update file
when it grows beyond 2GB (uncompressed).
The problems are either at the operating system level where the filesystem does
not support >2GB files (sigh...), or SRS5 which does not seem to like indexing
a file >2GB or so. Perhaps other application software has sproblems also.
In *addition* to the cumulative.dat.gz files we make available on our FTP
server, we now also split that dataset into smaller parts, namely:
cum_1.dat.gz !all the cumulative file except the data in files below.
cum_hum1.dat.gz !human (non-gss, non-HTG, non-EST)
We chose this data division:
- because it represents the current major division of data in the updates, or
our best guess about what will happen in the coming months.
- to try to keep the files below 2GB for as long as possible, eg before needing
- to avoid frequent changes. These filenames should last a while and
contain some data even soon after a new full release. SRS icarus files (for
example) should not need constant changing, and when they do we should be able
to see it coming well beforehand.
Those who 'mirror' our ftp.ebi.ac.uk/pub/databases/embl/new directory may wish
to have either the full cumulative.dat.gz or the cum_* set ignored, in order
to save some netwidth. The data contained is the same - we build the cum_*
files from the cumulative.dat.gz file afresh each day. There is no
correspondence between, say, cum_est1.dat and the est1.dat file of the full
FYI the EBI SRS server now has the EMBLNEW dataset using these cum_* files.
More information about the Bio-srs