Announcement: Sagittarius PIR-44 available via anonymous FTP

Fri Jun 16 09:02:10 EST 1995

		SAGITTARIUS PIR-44 (April 1995) variant

    SAGITTARIUS PIR is a highly compact databank variant of original PIR 
database designed to assist individual researchers and software developers 
in utilization of sequence database information without huge storage space
requests. It contains custom compressed PIR information and C-written interface 
which allow fast direct access to the stored information without total 
decompressing of corresponding files. Starting from PIR-41 version, one 
and the same databank files as well as interface C-file can be used on both 
PC-compatibles and UNIX V computers (with forthcoming Mac interface version), 
without any modifications. Interface supports all standard PIR Request Network 
queries (i.e. get databank SEQ number by entry; for the defined databank SEQ 
number, get specified information like: name, organism(s), keyword(s), 
sequence, sequence features with coordinates etc.). In difference with PIR 
Request Network, SAGITTARIUS PIR allows you to call PIR-contained information 
directly from your C program, even on the personal computer separated from any 
network. In addition, all numerical information like introns placement and 
concrete feature coordinates are aquired by calling program in the form of 
constants instead of text strings what simplifies (sub)sequence manipulations. 
For even larger storage compactness and flexibility, SAGITTARIUS PIR was 
realized in the form of separate file sets, where each file set contains 
database information of independent type (i.e. sequences, entry indexes, 
organisms etc.). On the particular computer, available configuration of the 
PIR information could be easily changed as needed by the user itself without 
any damage for retrievals of other types of stored information. For example, a
file set which contains protein sequences itself and their PIR entry indexes 
(including reverse indexation arrays) takes less than 14Mb of disk space.
    For PC-compatibles (with fortcoming version for UNIX/Xwindow), a dialog 
shell is available which supports all standard PIR Request Network queries 
plus homology searches, alignments etc.
    SAGITTARIUS PIR is distributed freely (all databank file sets, interface
C-file, test/example program C-file, PC-shell executable) via anonymous FTP.
File sets and interface can be used/included in any commercially distributed
package without any restrictions. Consultations and advanced interface 
variants (currently used to support fast effective database manipulations 
in other SAGITTARIUS family packages) are available from the developers upon 

  For now, SAGITTARIUS PIR compressed databank stores in custom compressed 
form following original informational types (fields) of PIR database:

	  - database entry index 
	  - accession number(s) 
	  - other (non-PIR) database crossreference(s)
	  - protein name 
	  - organism name(s) 
	  - alternative protein name(s)
	  - keyword(s)
	  - superfamily name(s)
	  - gene name(s)
	  - map position(s)
	  - unusual start codon(s)
	  - intron(s) placement
	  - literature reference(s), including for each:
	  	-journal or citation 
	        -free-format comment
	  - sequence feature(s)
	  - free-format comment
	  - protein sequence itself

  For PIR-44, all bank files takes 37+ Mb on hard disk (23+ Mb in transport
ZIP-compressed form). Each original database informational field (i.e.
sequences, organisms, names, keywords etc.) is stored in separate file set 
what allows the user to configure reduced bank variants by simply excluding
unnecessary information files from unpacking. For example, deletion of 
literature references reduces the bank to only 27+ Mb. Core (minimal 
configuration supported by available PC-shell) variant of databank files 
includes only indexes and sequences. All more complete configurations could 
be produced by simply adding (depacking from distributive) of corresponding 
file sets.

     List of distributive files with decompressed files description

ZipFile    ZipSize                Content Description
------------------------------------------------------\/  Core config part  \/
CORE    12,542,278                Entry indexes + sequences itself
------------------------------------------------------\/ User-variable part \/
NAME     1,137,312                Sequence names
ORGANISM   533,940                Organisms
KEYWORD    322,655                Keywords
S_FAMILY   168,410                Superfamily classifications
CROSSREF   327,537                Other database crossreferences
FEATURE  1,091,777                Sequence features
GENE_MAP    29,621                Genetic map positions
ALT_NAME   220,725                Sequence alternative names
GENE       194,431                Genes
CODON       10,430                Unusual start codons
ACC_CODE 1,569,822                PIR accession codes
COMMENT    285,206                Sequence comments
INTRON      47,096                Intron(s) placement
REF_JOU    991,882                References core : references itself
REF_AUTH 1,779,989                 Ref. extention : reference authors
REF_TITL 2,108,161                 Ref. extention : reference titles
REF_COMM    63,120                 Ref. extention : reference comments
------------------------------------------------------\/ Dialog shell for PC \/
PC_SHELL   194,641                PC-executable + two MAP-files (to \PIR)
------------------------------------------------------\/       Interface     \/
INTERFAC    15,350                Interface and test program, C-files, PRJ file

  SAGITTARIUS PIR Automated Sequence Bank is a dialog shell for manipulation 
of the compressed sequence database information with orientation on 
MS DOS/Windows PC-compartibles, with installed hard disk optimizers 
(like Smartdrive, Hyperdisk, Ncache etc.). P5 or 486 are recommended, 
386/286 will be significantly slower but still OK. 

The dialog data shell supports the following main operations:
   - selection of sequences to bank buffer by
        - dictionary-defined record for specified informational
          field (name, source, keyword, feature etc.)
        - user-defined context in specified informational
          field (name, source, keyword, feature etc.)
        - set of dictionary-defined records for different informational
          fields (source, keyword, superfamily etc.)
        - SEQ (non)perfect homology with user-defined short sequence
   - store and retrieve buffer content (SEQ bank numbers and indexes)
     between sessions
   - output user-specified (buffer) SEQ data to disk files
   - fast SEQ homology searches (for user-defined SEQ of length not more 
     than 50-100 residues, only 1 hour with full PIR bank on 486/33)
   - fast subregion-sensitive pairwaise alignments (user-defined
     sequence with buffer SEQ's or full bank)
   - easy data access from user programs (C) as a support for
     applications development 

    SAGITTARIUS data bank files are usually filled out by current available 
PIR database information by distributors only. Distributive variant includes 
ready-for-use informational files, interface and executables - all in 
compressed form.


  SAGITTARIUS PIR is available by anonymous FTP from:

     FTP.SCRI.FSU.EDU, directory /pub/genetics/pir/

  SAGITTARIUS PIR is also available by anonymous FTP from some 
of the well-known bio-servers (IUBIO etc.).


			Installation on UNIX system

  All decompressed SAGITTARIUS databank files must be placed in one and
the same directory which name should be correctly specified in the interface
C-file (first strings). Placement of the 'X' symbol in the first position of 
the corresponding text string (instead of '/') will force interface to carry
out formal check for databank files presence.

			Installation on PC

  If you plan to use PC-shell, all decompressed SAGITTARIUS databank files 
must be placed in the directory \PIR on any (but one and the same) logical 
drive. Otherwise all decompressed SAGITTARIUS databank files must be placed 
in one and the same directory which name should be correctly specified in 
the interface C-file (first strings). Placement of the 'X' symbol in the first 
position of the corresponding text string (instead of some drive letter) will 
force interface to carry out 1) search of corresponding directory on all 
available logical drives and 2) formal check for databank files presence.

  Bank executable BANK.EXE may be placed in any directory on any logical 
drive. Important: .MAP-files contained in the same PC-shell compressed file 
should be moved into the directory \PIR where other databank files are placed.

  It is  highly reccomended to run BANK.EXE from directory (and/or logical 
drive) other than data location to avoid random bank files structure 
damage. Bank is oriented on file-server data accession and can find 
\PIR directory (and test them for correct data configuration) on 
any logical drive.



This package (with compressed data files) can be redistributed
freely without any limitations but only free of charge and for 
non-commercial usage. No changes in data files and/or executables 
are allowed.

You may include compressed SAGITTARIUS datafiles in your application 
packages freely even in the case of any commercial usage.


For helpful comments and discussions please contact

	Dr. Victor B. Strelets (strelets at

	Computational Genetics and Biophysics,
	Supercomputer Computations Research Institute, 
	FSU B-186, Tallahassee, FL 32306-4052, USA


For control purposes you may use following info about the SAGITTARIUS PIR 
distributive files (with information about compressed files):

CORE     ZIP    12,542,278
	SEQ0     BAN       330,572
	SEQ      BAN    10,911,789
	DIC      BAN       330,888
	DIC2     BAN       106,496
	IND      BAN       661,144
	IND0     BAN       330,572
	IND2     BAN       331,776
	0IND     REV       330,572
	1IND     REV       330,572

NAME     ZIP     1,137,312
	NAM      BAN     1,135,808
	NAM0     BAN       330,572
	NAM2     BAN       210,944
	0NAM     REV       209,100
	1NAM     REV       330,564

ORGANISM ZIP       533,940
	SOU      BAN       116,480
	SOU0     BAN       330,572
	SOU1     BAN        84,824
	SOU2     BAN        26,624
	0SOU     REV        26,020
	1SOU     REV       504,612

KEYWORD  ZIP       322,655
	KW       BAN        20,816
	KW0      BAN       330,388
	KW1      BAN       121,476
	KW2      BAN         6,144
	0KW      REV         5,672
	1KW      REV       323,992

GENE     ZIP       194,431
	GENE     BAN        89,792
	GENE0    BAN       279,284
	GENE1    BAN        44,160
	GENE2    BAN        36,864
	0GENE    REV        35,644
	1GENE    REV        58,816

ALT_NAME ZIP       220,725
	ANAM     BAN       151,432
	ANAM0    BAN       330,368
	ANAM1    BAN        45,028
	ANAM2    BAN        38,912
	0ANAM    REV        36,956
	1ANAM    REV        58,772

S_FAMILY ZIP       168,410
	SFAM     BAN        61,360
	SFAM0    BAN       278,972
	SFAM1    BAN        29,064
	SFAM2    BAN        14,336
	0SFAM    REV        14,020
	1SFAM    REV       134,296

ACC_CODE ZIP     1,569,822
	AC       BAN       751,104
	AC0      BAN       330,572
	AC1      BAN       377,236
	AC2      BAN       376,832
	0AC      REV       375,552
	1AC      REV       375,552

CODON    ZIP        10,430
	CDN      BAN           216
	CDN0     BAN       271,464
	CDN1     BAN           184
	CDN2     BAN         2,048
	0CDN     REV           108
	1CDN     REV         4,668

FEATURE  ZIP     1,091,777
	FT       BAN       323,616
	FT0      BAN       309,148
	FT1      BAN       328,440
	FT2      BAN        53,248
	FTN      BAN     1,132,072
	FTN0     BAN       309,148
	FTN1     BAN       328,440
	0FT      REV        53,204
	1FT      REV       270,768

GENE_MAP ZIP        29,621
	MAP      BAN        35,888
	MAP0     BAN       276,328

COMMENT  ZIP       285,206
	CC       BAN       524,056
	CC0      BAN       310,212

CROSSREF ZIP       327,537
	CR       BAN       416,416
	CR0      BAN       330,572

INTRON   ZIP        47,096
	INTR0    BAN       278,936
	INTR1    BAN        60,204

REF_JOU  ZIP       991,882
	REF      BAN     1,215,520
	REF0     BAN       330,572
	REF1     BAN       456,592

REF_AUTH ZIP     1,779,989
	AUT      BAN     2,707,832
	AUT0     BAN       330,572
	AUT1     BAN       456,592

REF_TITL ZIP     2,108,161
	TITLE    BAN     3,419,552
	TITLE0   BAN       330,564
	TITLE1   BAN       425,216

REF_COMM ZIP        63,120
	REFCOM   BAN        66,488
	REFCOM0  BAN       330,520
	REFCOM1  BAN        37,784

PC_SHELL ZIP       194,641
	BANK     EXE       535,134
	BANK1    MAP       165,600
	BANK2    MAP       165,600

INTERFAC ZIP        15,350
	INTERF   C	    26,237
	BANK     H	    12,163
	BANK0    H	       	50
	INTERF   H	     6,048
	BANKTEST C	     9,795
	BANKTEST PRJ         5,353

Standard disclaimer:
Author(s) will in no way be held liable for any loss of profit or 
any other commercial damage including but not limited to special,  
incidental, consequential or other damages from use of this 
package. You may use them only with the understanding that 
you use it at your own risk  and that your use of the software 
and datafiles is your agreement to this disclaimer. 

More information about the Bio-soft mailing list