EMBL 'unique' entries available for ftp?

Charles Bailey bailey at hmivax.humgen.upenn.edu
Fri Oct 29 15:51:52 EST 1993

First off, I apologize to anyone confused by my long delay in following up this
thread.  I've received a few replies to my original post, which come to
differnt conclusions about the size of the EMBL36-GenBank79 exclusions set, and 
I'm attempting to resolve this before I post a summary back to the net.  More
on this once I've sorted things out . . .

In article <1993Oct24.130801.24171 at comp.bioz.unibas.ch>, doelz at comp.bioz.unibas.ch (Reinhard Doelz) writes:
> Charles Bailey wrote some time ago; 
>>We use GenBank as our primary nucleic acid sequence database, but also maintain
>>a local copy of those entries in EMBL whose primary accession numbers do not
>>appear in GenBank.  This was about 10000 entries as of EMBL release 34, and is,
>>I gather, dropping as GenBank, EMBL, and DDBJ sort out the backlog. Nontheless,
>>as long as there remain 'unique' EMBL entries, we would like to keep them
> The  'backlog'  is  rather  a negotiation between database providers. The most 
> [description of BBMATCH deleted]

This kind of thing will be useful in the future, but, as you mention, for now
one has to keep these entries around.  This particular issue (scanned vs
submitted entries) may take a while to solve, since, as I understand it, NCBI
is finding that there are differences between some of the bb records and the
'equivalent' submitted records.  I expect that thes represent changes to the
data that the authors incorporated into their fiugres between submission of the
original sequence and publication, but didn't pass on to the database
maintainers.  This highlights a problem: despite a strenuous effort by
the database maintainers, the biologc community has not yet been convinved that
sequences in the database must be viewed as 'live' data, and investigators
should notify the databases of changes to the sequence as they are discovered. 
This will benefit everyone, as analysis of sequence data will be based on a
better substrate, and investigators who want to use retrieved data for wet
experiments won't have to rely on obsolete data or try to track down the
original depositor.

>>available to local users.  Until this summer, we were grateful users of Mike
>>Cherry's ftp site in the US to obtain the 'unique' dataset, but that has been
>>closed due to concerns about the network load generated by the transAtlantic
> I  must  admit  that  I  was  the origin of the primary data, and Mike Cherry 
> mirrored these via  ftp script in order to compute the exclusion set of these 
> the data. As we (Biocomputing Basel) were requested to reduce network traffic 
> as  much  traffic  as  possible,  we  cancelled this mirror. This measure was 
> essential in order to  avoid that we needed to shut down the entire services. 
> My apologies for any  inconvenience caused by this. Thus, EMBL has a FTP ser-
> ver (ftp.embl-heidelberg.de) with primary data in case of real need. 

Please note that I wasn't casting aspersions on your decision to stop the
mirror.  In fact, it's for precisely the reason that bandwidth, especially
across the Atlantic, is narrow and expensive, that I'd like to get and mirror
an exclusion set - it minimizes the data which must be transferred, and it
provides a site in NA to originate transfers to NA sites.

> [text deleted]
> My point  is  that  public  domain  data  are  free, but providing access to 
> these data cost money. Furthermore,  the  collection  and synchronisation of
> updates involves resources and therefore generates added-value data. The fu-
> ture development will hopefully allow us to provide data to  the  community, 
> within reasonable limits, for free, but we can neither promise nor make sure
> that network costs will always be covered by  general  infrastructure rather 
> than by the institutions causing the traffic. Not to mention staff and main-
> tenance cost for the hardware...

This is a valid point.  I agree that it's unreasonable to support routine
access to full databases across transatlantic links.  Ultimately, it may be
necessary to require that this exchange of data occur offline via CD or tape.
However, given that there seems to be bandwidth available within national
networks for this kind of distribution, and given that it's arguably more
efficient than getting every site static media containing the data, I'm hoping
that a distribution scheme can be worked out.  I'm willing to put the
exclusion set I generate up for ftp for now; if I'm swamped beyond our local
resources, I, too, may have to reconsider.  In general, I'd advocate the
following principles for data exchange:
 - we should try to provide sites with access to the available data in as
   efficient a manner as possible.  Ultimately, this will be achieved when
   NCBI, EMBL, and DDBJ sort out the differences between their databases,
   and each database covers the content of the others.  In the meantime,
   exclusion sets seem to me the best way to go, though figuring out
   precisely what should go into an exclusion set can be tricky.
 - we should try to provide this data in such a way as to minimize load on
   net bottlenecks.  The Basel-Boston mirror was an example of this, but it
   fell victim to limitations on available resources, and to overuse of the
   Basel site by US sites which should have been using the mirror.
 - sites retrieving data should be careful to minimie net consumption.  If you
   can get the data from a local site,do it in preference to a distant site.
   If you can use an exclusion set, use it instead of the whole database. Get
   daily incremental updates instead of retrieving the entire cumulative update
   every day.  Cooperate with other nearby sites to exchange data and minimize
   the number of sites which have to go to the central source.  Etc.
I realize that this won't solve all usage problems, and in some ways assumes
the presence of a backbone like NSFnet or EMBnet to make the intraregional
transfers reasonable, but it's at least a start.

Thanks for everyone's time and patience on this thread.

                    Charles Bailey

!             Dept. of Genetics / Howard Hughes Medical Institute
! University of Pennsylvania School of Medicine  Rm. 430 Clinical Research Bldg.
!     422 Curie Blvd.  Philadelphia, PA 19104 USA      Tel. (215) 898-1699
!          Internet: bailey at genetics.upenn.edu  (IN

