EMBL 'unique' entries available for ftp?

Reinhard Doelz doelz at comp.bioz.unibas.ch
Sun Oct 24 08:08:01 EST 1993

Charles Bailey wrote some time ago; 

>We use GenBank as our primary nucleic acid sequence database, but also maintain
>a local copy of those entries in EMBL whose primary accession numbers do not
>appear in GenBank.  This was about 10000 entries as of EMBL release 34, and is,
>I gather, dropping as GenBank, EMBL, and DDBJ sort out the backlog. Nontheless,
>as long as there remain 'unique' EMBL entries, we would like to keep them

The  'backlog'  is  rather  a negotiation between database providers. The most 
recent EMBL data library has a list of entries called "BBMATCH". These entries
from the NCBI backbone database are expected to be duplicates of already exis-
ting data, and are not included in the EMBL data library.  This  implies  that 
searches  for an accession number in the EMBL data library will need to search 
this  file  as  well in order to locate the correct data from the backbone en-
tries. As established software is currently not prepared to go that way, it is
required  to  keep  these 'additions' if the users want to retrieve entries by 
accession numbers. 

>available to local users.  Until this summer, we were grateful users of Mike
>Cherry's ftp site in the US to obtain the 'unique' dataset, but that has been
>closed due to concerns about the network load generated by the transAtlantic

I  must  admit  that  I  was  the origin of the primary data, and Mike Cherry 
mirrored these via  ftp script in order to compute the exclusion set of these 
the data. As we (Biocomputing Basel) were requested to reduce network traffic 
as  much  traffic  as  possible,  we  cancelled this mirror. This measure was 
essential in order to  avoid that we needed to shut down the entire services. 
My apologies for any  inconvenience caused by this. Thus, EMBL has a FTP ser-
ver (ftp.embl-heidelberg.de) with primary data in case of real need. 

>I am coming, then, hat in hand, to ask whether anyone has this dataset
>available? I would be happy to mirror it for anonymous ftp here, and don't see
>any need for incremental updates, since I expect very few of these 'unique'
>entries are corrected or altered without a corresponding addition to GenBank.
>Since I'm using this via the GCG package, I can take the dataset as flatfile or
>in GCG format.

I mailed Charles  privately  with respect to options. Let  me  raise a rather 
general point here. 

The EMBL data library, as other data libraries, is public  domain.  They  get 
resources paid  as part of their contract to make the data available. The way 
of  accessing  these  data  is  via  CD-ROM  subscription.  Updates via elec-
tronic networks are extremely expensive  if  run  as  full-file  FTP copy via 
transatlantic links. In  particular,  this affects sites which are _not_ paid 
for the data distribution  activities  as part of their business. These costs 
are  the  costs  to run the network, as well as keeping the data sets stable. 
Therefore, for us (Biocomputing Basel) there is 
	- staff cost to maintain and monitor the preparation of updates
	- disk and other hardware cost to prepare datasets and backups
	- network cost to be paid to the network provider. 

Making data available for 'free' therefore  means  that  the  cost as listed 
above needs to be covered. Funding agencies which  fund  EMBnet  Switzerland
are interested to fund access  for  local  or  national  sources, and do not 
desire  to  spend  money in this kind of global service provision. Notabene: 
_Collaborations_, i.e. a bilateral exchange, is fine, but we  had in the end 
several Gigabytes per month downloads towards transatlantic destinations. 

My point  is  that  public  domain  data  are  free, but providing access to 
these data cost money. Furthermore,  the  collection  and synchronisation of
updates involves resources and therefore generates added-value data. The fu-
ture development will hopefully allow us to provide data to  the  community, 
within reasonable limits, for free, but we can neither promise nor make sure
that network costs will always be covered by  general  infrastructure rather 
than by the institutions causing the traffic. Not to mention staff and main-
tenance cost for the hardware...

This discussion is not only valid for 'unique' data from EMBL in Genbank but
certainly also applies to the other way round. 

Reinhard Doelz

EMBnet Switzerland
