dFLASH server for latest GenBank/PIR/SwissProt (Release 1.1.0)

The dFLASH Project dflash at watson.ibm.com
Thu Nov 17 12:26:20 EST 1994


The dFLASH Group wishes to announce release 1.1.0 of the dFLASH electronic
mail server.  Beginning with this release of the server, we will be supporting
the latest release of the GENBANK, PIR and SWISSPROT databases.

In particular, users  can now carry out searches in 
        GENBANK    Release 85 (September 30, 1994)                             
        PIR        Release 42 (September 30, 1994) --> DEFAULT Database <--    
        SWISSPROT  Release 30 (October   30, 1994)                             

Full bibliographic references can optionally be included with the computed
alignments, for all three databases.

Notice that a number of necessary changes and additions have been incorporated
in the "query language".  For example, since we now support a larget set of
databases, "target protein" is not a valid directive anymore! The appended help
file describes the changes and available functions in detail.

NEW FEATURES:
   o    the reported results can now be sorted using a sorting key specified by
   	the user via the "query language"

   o    a smart-email filter has been implemented:  various specification
        errors  are now caught and corrected automatically; notifications
        are sent to the user for all taken actions.

It is our intention to update the server with the latest release of each of the
above dbases within the first two weeks after it becomes available.

The server is accessible through the Internet and is now operating 24 hours a
day, 7 days a week and can be accessed both directly and through "Grail" of the
Oak Ridge National Lab.

Sincerely,

The dFLASH Group





------------------------------>  CUT HERE <-----------------------------------

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! The dFLASH server now supports the GenBank, PIR and SWISSPROT databases.  !!
!!      The supported releases are:                                          !!
!!      GENBANK    Release 85 (September 30, 1994)                           !!
!!      PIR        Release 42 (September 30, 1994) --> DEFAULT Database <--  !!
!!      SWISSPROT  Release 30 (October   30, 1994)                           !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!                         N O T A     B E N E                               !!
!! The dFLASH server is still under development.  If some of the answers do  !!
!! not make sense it is very likely that this is due to a bug in our code.   !!
!! Please, email bug reports and comments to dflash at watson.ibm.com with      !!
!! subject line "bug" or "comments".                                         !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    Dear User, welcome to Release 1.1.0 of the the dFLASH server!

    The dFLASH server is a "homologous sequence retrieval" program for PROTEIN
and DNA sequences.  

    dFLASH is a parallel system running on an IBM SP/x architecture. Intra-node
communication, evidence integration and alignment are performed in parallel. 
The system has been implemented using IBM's Concert/C language for distributed
programming. The server is available 24 hours a day, 7 days a week and can be
accessed both directly and through "Grail" of the Oak Ridge National Lab.

    Incremental changes and improvements made to the server will be reflected
in the "Message of the day" at the beginning of this help file:  we recommend
that users periodically issue a `send help' request for up to date information
on the server.

    For the moment, we can process requests originating from email addresses of 
the form 
                 user@[machine.][subdomain.]institution.type
                        or 
                 user%machine@[machine.][subdomain.]institution.type
                        or 
                 "string::user"@[machine.][subdomain.]institution.type

We plan to further expand the accepted formats, depending on demand.


$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$

HOW TO USE THE SERVER: You can use the dFLASH facilities by sending an email 
---------------------- message with the appropriate syntax to the address 
		       "dflash at watson.ibm.com" (without the quotes).
			
SUBJECT LINE: It is important that the "Subject" line of your message contain 
------------- one of: { dflash, dFlash, dFLASH, DFLASH }.  Messages whose
              subject line does NOT conform to this rule, **WILL BE LEFT
	      UNPROCESSED**.  The reason for that restriction is that we want
	      to be able to automatically distinguish between messages that are
	      addressed to the server and those that are meant for one  of the
	      group members.

$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$


MESSAGE FORMAT: The typical message-body of an email request looks as follows
---------------

     BLOSUM 62                                  (optional  | DIRECTIVE)
     VERBOSE  10 20                             (optional  | DIRECTIVE)
     SEQUENCES  100                             (optional  | DIRECTIVE)
     ALIGNMENTS 50                              (optional  | DIRECTIVE)
     THRESHOLD  30                              (optional  | DIRECTIVE)
     KEY XMATCH					(optional  | DIRECTIVE)
     SOURCE PROTEIN				(optional  | DIRECTIVE)
     TARGET SP                                  (optional  | DIRECTIVE)
     BEGIN                                      (mandatory | DIRECTIVE)
     >A_ONE_LINE_TEST_SEQ_LABEL                 (mandatory -- notice the '>' )
     a_sequence_of_{amino_acids,nucleic_acids,spaces,tabs}
     1                                          (mandatory terminator)

    The PAM/BLOSUM, VERBOSE, SEQUENCES, ALIGNMENTS, THRESHOLD, KEY, SOURCE and
TARGET directives can appear in any order but they *must* precede the BEGIN
directive. The BEGIN line must be followed by the LABEL line which in turn
should be followed by the test sequence.

    The test sequence should contain at least 18(=proteins)/54(=dna) and not
more than 1500 amino acid or nucleotide characters.  But it may contain ANY
NUMBER of CARRIAGE RETURN TAB and SPACE characters; the latter are not of
course counted while computing the length of the test sequence. There is NO
case sensitivity in the label and the test sequence itself.  If the test
sequence is longer than 1500 characters, the e-mail filter will truncate it to
the first 1500 characters and will send a note to that effect to the originator
of the query; the filter will then submit the truncated sequence to the search
engine.

NOTA BENE:  The words appearing on the lines marked DIRECTIVE above can be in 
----------  lower case or upper case; in other words, you can have pam or PAM, 
	    threshold or THRESHOLD, alignments or ALIGNMENTS, etc.  However,
	    something like ThReShOlD will not work.

    The directive pertaining to the scoring matrix allows the user to specify
the matrix to be used for computing the alignment scores.  You can use either
the word PAM followed by a space and the desired distance, or the word BLOSUM
followed by space and the desired distance.  Examples:  PAM 250, BLOSUM 62 etc.
If no matrix directive is included in the message, PAM 250 is used as the
default.  Depending on the values of the directive TARGET (see below) the
matrix directive if present may be ignored.

    The VERBOSE line allows the sender to also retrieve the data about authors,
dates, entries, superfamilies etc. that are contained in the original PIR,
SwissProt and GenBank databases.  This directive accepts one OR two arguments;
for example:
                verbose         15      25
means "send me the text data for the sequences occupying positions 15 through 25
in the final ranking."  On the other hand,
                verbose         15
means "send me the text data for the sequences occupying the first 15 positions
in the final ranking."  If no verbose line appears, no citation data is sent.

    The SEQUENCES line allows one to restrict the reported sequences to the
given number.  This directive controls the number of entries in the ``short
list'' of recovered database sequences only.  If no SEQUENCES line is given,
the server code will set it to an appropriate default value (100).

    The ALIGNMENTS line allows one to restrict the reported alignments to the
given number.  If no ALIGNMENTS line is given, the server code will set it to
an appropriate default value (100).  The ALIGNMENTS value cannot exceed 5000.
Values larger than 5000 are reduced to 5000.

    The THRESHOLD line allows one to restrict the number of reported sequences
(and thus alignments) to only those whose Score exceeds the given THRESHOLD
value.  If no THRESHOLD line is given the server code will set it to an
appropriate default value.  The default values are 50 for DNA sequences, and 80
for protein sequences.   There is also a *hard* threshold value of 40 for DNA,
and 30 for PROTEIN sequences;  if the user-requested values are smaller than
these hard-thresholds, the requested threshold will be increased accordingly.
NOTA BENE:  (1) if the THRESHOLD value is too small, you are running the danger
----------  of upsetting your mailer program since chances are that you will
            receive a very big file as a reply from the server.  
	    (2) if the THRESHOLD is too high the list of recovered entries 
	    will be empty, or very short; you should decrease the threshold's
	    value and resubmit your query.

    The KEY line allows the user to specify the key to be used when sorting the
results (retrieved sequences) corresponding to a submitted search request.  The
keyword KEY can be followed by one of { SCORE,score,   LENGTH,length,  PEAK,
peak,  GAP,gap,  MATCH,match,  XMATCH,xmatch }.   By setting KEY to one of
{SCORE,score} the user indicates that the retrieved sequences should be sorted
in decreasing order of total computed score.  By setting KEY to one of {LENGTH,
length} the user indicates that the retrieved sequences be sorted in decreasing
order of their length.  Setting KEY to one of {PEAK,peak} will  result in the
retrieved sequences being sorted in decreasing order of the maximum score value
over *any* 18(=proteins)s or 54(=dna) residue window of the recovered match.
Setting KEY to one of {GAP,gap} will  result in the retrieved sequences being
sorted in decreasing order of the maximum gap inserted that will result in a
best alignment with the query strand.  Setting KEY to one of {MATCH,match} will
result in the retrieved sequences being sorted in decreasing order of the total
(=conservative+exact) number of matches with the query strand. Finally, setting
KEY to one of {XMATCH,xmatch} will sort the retrieved sequences in decreasing
order of the number of exact matches with the query strand.  If no KEY directive
is specified, the retrieved sequences will be sorted in order of decreasing
"score".

    The SOURCE line allows the user to specify the type of the query strand as
being a { PROTEIN,protein,    DNA,dna } sequence.  By setting  SOURCE to one of
{PROTEIN,protein} the user indicates that the query strand is a sequence of
amino acids.  By setting SOURCE to one of {DNA,dna}  the user indicates that the
query strand is a sequence of nucleotides. 

    The TARGET line allows the user to specify the type of the target database
to be one of { PIR,pir,   SP,sp,   GB,gb }.  This way the user controls the
database in which the search will be carried out.  If TARGET is set to one of
{PIR,pir}, the search will take place in the PIR database. If TARGET is set to
one of {SP,sp} the search will take place in the SWISSPROT database.   If
TARGET is set to one of {GB,gb},  the search will take place in the GenBank
database. Requests for searches in unsupported databases will be *IGNORED* by
the server and generate a complaint message that will be sent back to the
originator of the request.

If *only* SOURCE is specified, then the TARGET will be set automatically: in
particular, if SOURCE is set to one of { protein, PROTEIN } then the search 
will be carried in the "PIR" database, whereas if source is set to one
of { dna, DNA } then the search will take place in the "GB" database. If
*neither* SOURCE *nor* TARGET lines are given, the server will assume it is
dealing with an amino acid strand and carry out the search against the "PIR"
database.

    The LABEL line allows the user to enter mnemonic information pertaining the
the test sequence, the time of the day etc.  The information of this line will
be reproduced in the Subject line of the reply message.   Notice that the
LABEL line *must* begin with the character '>'.

    All the submitted messages must be terminated by the number '1'  This
number can follow the last character of the test sequence or be in a line by
itself.


$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$

A 'SMART' FILTER:   The email filter that allows for the above message format 
-----------------   has been improved in this release.  In particular, the 
filter is 'smart' enough to catch inconsistencies in the user's message. The 
filter will correct them and send a note to the originator of the message. 
*Unlike* older releases of the filter, this version will submit the corrected
message to the search engine.  The filter will also send one email note to the
originator of the query for *every* change it has carried out; the note(s)
will contain information about the actions that the filter has taken.

For example, if the user's note contains the following lines

	sequences 20
	alignments 50 
	verbose 10 30

the filter will reset the value of 'alignments' to 20, and of the 'verbose_to
to 20, and subsequently submit the corrected query to the search engine. Since
two changes took place, the filter will also send two email notes to the
originator of the query detailing the actions it has taken.

$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$


EXAMPLES:  Two example inputs follow
---------

Example 1: 
                pam 250
                sequences 50
                alignments 30
                threshold  100
		target pir
                begin
                > HBA_HUMAN STANDARD; PRT; 141 AA. P01922; HEMOGLOBIN ALPHA 
                VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
                TTKTYFPHFDLSHGSAQVKGHG     KKVADALTNA
                V A H V D D M PNALSALSDLHAHKLRVDPVNFK
                llshcllvtlaahlpaeftpavhasldkflasvstvltskyr
                1

            Note:  all amino acids  from "VLSP" through "ltskyr  will be used 
            in the search.  Not more than the 50 top scoring sequences will be
            reported in the short list.  Also, the alignments for the top 30
            scoring sequences will be returned.  No reported sequence will have
            score that is less than 100, and the reported sequences will be 
	    sorted in order of decreasing score.  The test sequence is declared
	    to be a sequence of amino acids and should be searched against the 
	    PIR database.

Example 2:
                BLOSUM 62
                KEY  XMATCH
	        BEGIN
                > Sequence sent to dflash on Fri May 20 13:40:17 EDT 1994
                VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
                TTKTYFPHFDLSHGSAQVKGHG     KKVADALTNA

                V A H V D D M PNALSALSDLHAHKLRVDPVNFK

                llshcllvtlaahlpaeftpavhasldkflasvstvltskyr
                1

            Note:  all amino acids  from "VLSP" through "ltskyr"  will be used 
            in the search.  The server code will set the various parameters to
            appropriate default values.  The server will treat the test sequence
	    as a sequence of amino acids (default) and will search against the 
	    "PIR" database (default) with a score threshold set at 80 (default).
	    The retrieved sequences will be reported in order of decreasing
	    number of exact matches with the query strand.

$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$

SCORING MATRICES:
-----------------

    You can use both PAM and BLOSUM scoring matrices for protein searches. These
can be requested via the optional { pam, PAM, blosum, BLOSUM } directive. The
currently supported distances are

for BLOSUM:  30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90, 100

for PAM:     10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150,
             160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280,
             290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410,
             420, 430, 440, 450, 460, 470, 480, 490, and 500.

For DNA searches, the PAM/BLOSUM declarations are ignored


$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$


NOTE ON ALIGNMENT:
------------------

    The server's alignment code implements the Smith-Waterman algorithm (dynamic
programming) to align each of the retrieved sequences with the test input. This
is *NOT* to be confused with the indexing method that we use to determine the
candidates to be aligned.

    The meaning of the variables in the listing that is returned by the dFLASH
server 

   .....
   ....

   Score Matrix: PAM250
   Max Reported Sequences:  1000
   Max Reported Alignments: 10
   Score Threshold  At: 65

    Id  Label:                                   Score  NRes  Ex% Tot% Sig  Pk
   ----------------------------------------------------------------------------
     1. HAHU hemoglobin alpha chain - human        655   141 100% 100% 100  89
     2. HACZ hemoglobin alpha chain - chimpanzee   655   141 100% 100% 100  89
     3. HACZP hemoglobin alpha chain - pygmy chi   655   141 100% 100% 100  89
     4. HAGO hemoglobin alpha chain - lowland go   654   141  99% 100%  99  89
     5. HAMQP hemoglobin alpha chain - hanuman l   653   141  97% 100%  99  89
     6. B27792 hemoglobin alpha-1 chain - orangu   649   141  97% 100%  99  89
     7. A25126 hemoglobin alpha-1 chain - Sumatr   649   141  97% 100%  99  89
    ...
    .....
    ..

is the following:

NRes:  the number of residues (amino acids) in the recovered match
Score: sequence  similarity score of the recovered sequence based on the
       selected mutation matrix
Ex%:   percentage of *exact* matching residues
Tot%:  percentage of *total* (=exact+conservative) matching residues
Sig:   100 times the ratio between the actual computed score and the score
       obtained by matching the retrieved sub-segment with itself; the
       denominator is the maximum obtainable score for the sub-segment in
       question (all gaps removed).
Peak:  the maximum score value over *any* 18(=proteins)s or 54(=dna) residue
       window of the recovered  match.


$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$


TO OBTAIN HELP:
---------------
    You can obtain this message at any moment by sending a message with one of:
{ dflash, dFlash, dFLASH, DFLASH } in the "Subject" line and a body containing
one of { help, HELP, send help, SEND HELP }.


$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$


TO OBTAIN ON-LINE REPRINTS OF PAPERS
------------------------------------
    You can obtain reprints (in PostScript) of relevant papers by sending a
message with one of: { dflash, dFlash, dFLASH, DFLASH } in the "Subject" line
and a body containing 

one of {flashpaper, FLASHPAPER, send flashpaper, SEND FLASHPAPER }        
                                        ---> returns to the originator of the 
                                        request a copy of the FLASH paper
					that will appear in `CABIOS'

one of {dflashpaper, DFLASHPAPER, send dflashpaper, SEND DFLASHPAPER }        
                                        ---> returns to the originator of the 
                                        request a copy of a paper that contains
                                        a description of dFLASH that has 
					appeared in `IEEE Computational Science
					and Engineering'

one of {concertpaper, CONCERTPAPER, send concertpaper, SEND CONCERTPAPER } 
                                        ---> returns to the originator of the 
                                        request a copy of a high-level paper
                                        describing the CONCERT/C language

one of {bayespaper, BAYESPAPER, send bayespaper, SEND BAYESPAPER } 
                                        --> returns to the originator of the 
                                        request a copy of a paper describing 
                                        a computer-vision application based 
                                        on similar to dFLASH indexing prin-
					ciples that will appear in `CVGIP-IU'

    Notice there can only be *one* such request per message! Also, make sure
you do not issue a new paper request until after the previous request has
returned to you all of the postscript files and you have removed the latter
from your mailbox:  the returned messages are rather big (between 1 and 4
Megabytes) and are guaranteed to overflow the disk set aside for mail messages
on most systems.


$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$


Thank you for your interest in the dFLASH server. 

                                        Sincerely,

                                        The dFLASH Group


$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$

COMMENTS??  We will appreciate receiving your feedback, suggestions, comments, 
----------  or bug reports; all of these can be sent to "dflash at watson.ibm.com" 
	    Please, make sure your  "Subject" line contains the word "comments"
	    or "bug".

$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$

REFERENCES  If you make use of the dFLASH server, please reference 
----------

     A. Califano and I. Rigoutsos, "FLASH: A Fast Look-up Algorithm for String
     Homology."  In  CABIOS.  To appear.

     I. Rigoutsos and A. Califano, "Searching In Parallel for Similar Protein
     Strings."  In IEEE Computational Science and Engineering, June 1994.

If you wish to find out more, you can contact Isidore Rigoutsos and Andrea
Califano at {rigoutso,acal}@watson.ibm.com


$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$


For more information on the Concert/C language, please refer to

     J. Auerbach, D. Bacon, A. Goldberg, G. Goldszmidt, A. Gopal, M. Kennedy,
     A. Lowry, J. Russell, W. Silverman, R. Strom, D. Yellin, and S. Yemini,
     "High-level language support  for programming reliable distributed
     systems."  In Proceedings of the International Conference on Computer
     Languages, April 1992, Oakland, California.

or contact Jim Russell (jrussell at watson.ibm.com)

$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$


------------------------------>  CUT HERE <-----------------------------------





More information about the Bio-soft mailing list