Updating databases

David Mathog mathog at seqvax.caltech.edu
Fri Jan 6 14:59:00 EST 1995


In article <1995Jan6.073436.13810 at comp.bioz.unibas.ch>, doelz at comp.bioz.unibas.ch (Reinhard Doelz) writes...
>David Mathog (mathog at seqvax.caltech.edu) wrote:
>: It would be *really* nice if the folks who maintain and distribute
>: databases would make it a bit easier to automate updates.
>: etc.
> 
>Comment 1: 
>The proposed stubs try to compensate the weakness of a transaction based 
>on a poll mechanism without query. Rather than the customer inquires in 
>detailed fashion, the current schema of FTP requires the FTP site to be 
>(1) fixed in file names (btw, what do you do with your software if a new 
>division appears)

Actually, part of the purpose of stub.txt was to compensate for minor 
variances in file names/types/locations.  If a new division appears it 
would have to be handled manually (the first time), same as it is now.

>(2) preprepared for any request, be it day-by-day, 
>week-by-week, or other

Well, yes, but the only valid "request" is "send me the description file".  
So in order to service that request all the FTP site needs is a current
description file.  Writing such a file might take 20 minutes the first
time, but should only take seconds for an update.  (Change the version 
number and date.)  Part of the reason to use stub files is it would make it 
obvious to the FTP site maintainer when they have changed a file name, 

>, and (3) does not at all tackle the question what 
>to do with it. The proposed schema does not mention whether the target is 
>to update (1) formatted (which format? ),  (2)  unformatted or (3) incremental 
>data. In particular, the latter is most appropriate at wide area networks 
>to save badnwith, and raises management problems at the local site. 

Since WAN access is currently "free" we are not yet worrying about the 
economics of continuous incremental vs. "release" forms of update.  While
it is undoubtedly extremely wasteful of bandwidth, we replace the entire 
database at each release, rather than trying to incrementally upgrade it to 
the same state.

You are right in that the stub files would be much more work for most types
of incremental database updates.  Of course, incremental updates are a big 
pain in their own right.

> 
>Comment 2:
>The proposed schema seems to imply a resource discovery based on static 
>listings. These are notoriously difficult to maintain and do not necessarily
>offer a quality control issue (see below). Even if this were a possibility,
>this implied that all sites referencing each other have the same policy of 
>'free for all' and do honor the same quality standards. 

Yeah, it would help if everybody could agree on a single format for the 
description files.  History does not indicate that this is likely.

> 
>Comment 3:
>The major problem in sequence database updating is that the quality of 
>an update cannot be judged if you download it as file. Neither date nor
>contents are sufficiently characterized in their format. A synchronisation
>is required (or at least desirable) which allows to crosscheck the contents
>of your local, adapted , formatted copy to the originally present data at 
>the provider. Versions and dates are nice but insufficient to characterize
>a contents in incremental updates. 
> 

Quite right - there is no simple way to judge database quality.  My
(parasitic) strategy has been to let any database release "age" for a
couple of weeks before downloading it.  This has usually resulted in some
other brave soul finding the problems in a particular release and the
database provider subsequently fixing them.

>Comment 4:
>The proposed schema requires a considerable amount of coordination in 
>between providers. The resources for the update buisness are fairly low
>as you, and many others, are neither prepared nor willing to pay for the 
>service you request. 

It was meant to be very cheap to implement on the server side - we're 
only talking 20-100 lines of text/major databases/site, and most of that 
will not change between releases.

We dealt with coordination above.

In closing - I'd happily live without the proposed files, so long as the
database providers make more of an effort to keep filenames and paths the
same between releases. 

Regards,

David Mathog
mathog at seqvax.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 




More information about the Bio-soft mailing list