Updating databases

David Mathog mathog at seqvax.caltech.edu
Thu Jan 5 15:10:00 EST 1995


It would be *really* nice if the folks who maintain and distribute
databases would make it a bit easier to automate updates.  It isn't all 
that hard now, but it seems that every time I go to do a set of updates
some file has been moved, or changed names, or is .txt.Z when it was just
txt the previous time, or in one way or another mutated so as to break my
retrieval software. 

This goal could be accomplished by doing something like the following:

1.  At the final FTP/e-mail distribution sites for a particular database
    place a file that is always called "stub.txt" that contains a retrieval
    description of that database - all databases to be described in the
    same format. 

2.  At several well known sites provide a file "dbs.txt" that contains
    pointers to the major distribution sites for each database.

Example of a stub.txt file:

DATABASE:databasename          (genbank/pir/swissprot etc.)
VERSION:40                     (text string)
RELEASE_DAY:5                  (numeric, day of month)
RELEASE_MONTH:12               (numeric, month of year)
RELEASE_YEAR:1994              (numeric, full year)
NUMBER_OF_FILES:10             (Number of files in full distribution)
1:RETRIEVAL_NAME:example.dat.Z (Valid name on server of first file)
1:RETRIEVAL_METHOD:ftp         (ftp or mail)
(for FTP)
1:RETRIEVAL_SITE:ftp.somewhere.edu  (if blank, same as for stub.txt)
1:RETRIEVAL_PATH:              (Relative to stub file, for FTP, here blank)
1:RETRIEVAL_TYPE:binary        (text,binary)
1:RETRIEVAL_SIZE:              (bytes)
(for mail)
1:RETRIEVAL_PIECES:15           (Number of pieces that will come back)
1:RETRIEVAL_TEMPLATE:Example.dat part ## of 15  (What the messages will say) 
1:RETRIEVAL_ADDRESS:listserv at somewhere   (mail address)
1:RETRIEVAL_SUBJECT:Send example.dat     (mail subject line)
1:RETRIEVAL_BODY:              (Here, a single blank line)
(for either)
1:PROCESS:LZ decompress        (UUD, unzip, debinhex, untar, etc.)
                               (Optional: multiple process steps)
1:FINAL_NAME:example.dat       (Suggested standard name when installed)
1:FINAL_TYPE:text              (text,binary)
1:FINAL_ORGANIZATION:genbank   (genbank or embl flatfile, ASN.1,MSWord...) 
2:RETRIEVAL_NAME:another.hqx   (Valid name on server of second file)
etc.

Example of a dbs.txt file:

DAY:5                          (as above - date of this dbs.txt file)
MONTH:12
YEAR:1994
DATABASE:databasename          (genbank/pir/swissprot etc.)
VERSION:40                     (text string)
DISTRIBUTION_SITES:5           (number of "official" distribution sites)
1:RETRIEVAL_METHOD:ftp         (ftp or mail)
(etc. Same syntax as for stub.txt, above)
2:RETRIEVAL_METHOD:mail        (ftp or mail)
etc.
DATABASE:nextdatabasename
etc.

The exact syntax doesn't matter, just keep it simple so that it's trivial 
to parse using DCL/shell scripts/basic/whatever.  Agree on cases so that
differences can be used and not trigger on "MONTH" vs. "month" (not a 
change), but do trigger on "/pub/VMS" vs. "/pub/vms" (is a change). 

Sites such as ours would use this information to:

1.  Periodically check the status of the databases that we maintain by 
grabbing the latest dbs.txt.  Difference it with the last one, if no
changes, then quit.  If changes are found, then check the version numbers
of the databases that we maintain.

2.  Retrieve the stub.txt file for each updated database from one of the
official distribution sites. Use the information in it to retrieve the
database.

With this system, database distributors could rearrange their sites, and
to some extent, change the names of the files, without breaking everybody's
retrieval software, so long as this retrieval software made use of these
two types of files.  Hmm, with a little care in the syntax this could also
be used to handle software distribution, although in that case I'd guess that 
most of us would want to test the software before having it install
automatically over the previous, working copy!

Comments?

David Mathog
mathog at seqvax.bio.caltech.edu
Manager, sequence analysis facility, biology division, Caltech 




More information about the Bio-soft mailing list