Charles Bailey wrote some time ago;
>We use GenBank as our primary nucleic acid sequence database, but also maintain
>a local copy of those entries in EMBL whose primary accession numbers do not
>appear in GenBank. This was about 10000 entries as of EMBL release 34, and is,
>I gather, dropping as GenBank, EMBL, and DDBJ sort out the backlog. Nontheless,
>as long as there remain 'unique' EMBL entries, we would like to keep them
The 'backlog' is rather a negotiation between database providers. The most
recent EMBL data library has a list of entries called "BBMATCH". These entries
from the NCBI backbone database are expected to be duplicates of already exis-
ting data, and are not included in the EMBL data library. This implies that
searches for an accession number in the EMBL data library will need to search
this file as well in order to locate the correct data from the backbone en-
tries. As established software is currently not prepared to go that way, it is
required to keep these 'additions' if the users want to retrieve entries by
>available to local users. Until this summer, we were grateful users of Mike
>Cherry's ftp site in the US to obtain the 'unique' dataset, but that has been
>closed due to concerns about the network load generated by the transAtlantic
I must admit that I was the origin of the primary data, and Mike Cherry
mirrored these via ftp script in order to compute the exclusion set of these
the data. As we (Biocomputing Basel) were requested to reduce network traffic
as much traffic as possible, we cancelled this mirror. This measure was
essential in order to avoid that we needed to shut down the entire services.
My apologies for any inconvenience caused by this. Thus, EMBL has a FTP ser-
ver (ftp.embl-heidelberg.de) with primary data in case of real need.
>I am coming, then, hat in hand, to ask whether anyone has this dataset
>available? I would be happy to mirror it for anonymous ftp here, and don't see
>any need for incremental updates, since I expect very few of these 'unique'
>entries are corrected or altered without a corresponding addition to GenBank.
>Since I'm using this via the GCG package, I can take the dataset as flatfile or
>in GCG format.
I mailed Charles privately with respect to options. Let me raise a rather
general point here.
The EMBL data library, as other data libraries, is public domain. They get
resources paid as part of their contract to make the data available. The way
of accessing these data is via CD-ROM subscription. Updates via elec-
tronic networks are extremely expensive if run as full-file FTP copy via
transatlantic links. In particular, this affects sites which are _not_ paid
for the data distribution activities as part of their business. These costs
are the costs to run the network, as well as keeping the data sets stable.
Therefore, for us (Biocomputing Basel) there is
- staff cost to maintain and monitor the preparation of updates
- disk and other hardware cost to prepare datasets and backups
- network cost to be paid to the network provider.
Making data available for 'free' therefore means that the cost as listed
above needs to be covered. Funding agencies which fund EMBnet Switzerland
are interested to fund access for local or national sources, and do not
desire to spend money in this kind of global service provision. Notabene:
_Collaborations_, i.e. a bilateral exchange, is fine, but we had in the end
several Gigabytes per month downloads towards transatlantic destinations.
My point is that public domain data are free, but providing access to
these data cost money. Furthermore, the collection and synchronisation of
updates involves resources and therefore generates added-value data. The fu-
ture development will hopefully allow us to provide data to the community,
within reasonable limits, for free, but we can neither promise nor make sure
that network costs will always be covered by general infrastructure rather
than by the institutions causing the traffic. Not to mention staff and main-
tenance cost for the hardware...
This discussion is not only valid for 'unique' data from EMBL in Genbank but
certainly also applies to the other way round.
| Dr. Reinhard Doelz | RFC doelz at urz.unibas.ch |
| Biocomputing | DECNET 20579::48130::doelz |
|Biozentrum der Universitaet | X25 022846211142036::doelz |
| Klingelbergstrasse 70 | FAX x41 61 261- 6760 or 267- 2078
| CH 4056 Basel | TEL x41 61 267- 2076 or 2247 |
+------------- bioftp.unibas.ch is the SWISS EMBnet node ----------------+
ftp mirror at nic.switch.ch