Announcing GENBANK and PIR search engine (The dFLASH server)

The dFLASH Project dflash at watson.ibm.com
Mon Jun 27 23:01:21 EST 1994

The dFLASH Group announces the release of a new electronic mail server that
allows GENBANK and PIR similarity searches with the FLASH algorithm.  The 
server is accessible through the internet and is now operating 24 hours a
day, 7 days a week.  Appended, you will find the new server's help file.


The dFLASH Group

------------------------------>  CUT HERE <-----------------------------------

!!  ====================    MESSAGE OF THE DAY   ==========================  !!
!!                                                                           !!
!!  June     14, 1994:   We are now supporting both GENBANK  &  PIR searches !!
!!                       Releases:  PIR 40      (no incremental updates)     !!
!!                              &   GENBANK 82  (no incremental updates)     !!
!!  June     12, 1994:   The VERBOSE directive now works correctly.          !!
!!  November 16, 1993:   Beginning today, *NO* registration is required      !!
!!                       in order to use the dflash server.                  !!
!!                                                                           !!
!!  ====================    MESSAGE OF THE DAY   ==========================  !!

!!                                                                           !!
!!                         N O T A     B E N E                               !!
!! The dFLASH server is still under development.  If some of the answers do  !!
!! not make sense it is very likely that this is due to a bug in our code.   !!
!! Please, help us improve this service  by reporting such bugs:  you can    !!
!! email bug reports and comments to dflash at watson.ibm.com with subject line !!
!! "bug" or "comments".                                                      !!
!!                                                                           !!

    Dear User, welcome to the dFLASH server!

    The dFLASH server is a "homologous sequence retrieval" program for PROTEIN
and DNA sequences.  We currently support the following databases:

    PIR     -- Release 40 with *no* incremental updates. 
       For more information, contact:  Protein Information Resource (PIR),
       National Biomedical Research Foundation 3900 Reservoir Road, N.W.,
       Washington, DC  20007, USA

    GenBank -- Release 82 with *no* incremental updates
       For more information, contact:  National Center for Biotechnology
       Information National Library of Medicine, 38A -- 8N805, 8600 Rockville
       Pike Bethesda, MD  20894,  USA

    dFLASH is a parallel system running on a prototype 16-node IBM SP/1. 
Intra-node communication, evidence integration and alignment are performed in
parallel and use a prototype interconnection network.  The currently observed
performance is estimated to be between 2 and 5 times slower than that of a
production line SP/1.  The system has been implemented using IBM's Concert/C
language for distributed programming. The server is now available 24 hours a
day, 7 days a week.  Backups are perfomed daily between 01:00 and 01:30 Eastern
Standard Time, so performance is likely to be affected during this period.  At
the moment we automatically alternate service between the GenBank and PIR 
servers and thus certain lag may be observed:  this will be alleviated in
future releases of the server code.

    Incremental changes and improvements made to the server will be reflected
in the "Message of the day" at the beginning of this help file:  we recommend
that users periodically issue a `send help' request for up to date information
on the server.

    For the moment, we can process requests originating from email addresses of 
the form 

We plan to further expand the accepted formats, depending on demand.


HOW TO USE THE SERVER: You can use the dFLASH facilities by sending an email 
---------------------- message with the appropriate syntax to the address 
		       "dflash at watson.ibm.com" (without the quotes).
SUBJECT LINE: It is important that the "Subject" line of your message contain 
------------- one of: { dflash, dFlash, dFLASH, DFLASH }.  Messages whose
              subject line does NOT conform to this rule, **WILL BE LEFT
	      UNPROCESSED**.  The reason for that restriction is that we want
	      to be able to automatically distinguish between messages that are
	      addressed to the server and those that are meant for one  of the
	      group members.


MESSAGE FORMAT: The typical message-body of an email request looks as follows

     BLOSUM 250                                 (optional  | DIRECTIVE)
     VERBOSE  10 20                             (optional  | DIRECTIVE)
     SEQUENCES  100                             (optional  | DIRECTIVE)
     ALIGNMENTS 50                              (optional  | DIRECTIVE)
     THRESHOLD  30                              (optional  | DIRECTIVE)
     TARGET PROTEIN                             (optional  | DIRECTIVE)
     BEGIN                                      (mandatory | DIRECTIVE)
     >A_ONE_LINE_TEST_SEQ_LABEL                 (mandatory -- notice the '>' )
     1                                          (mandatory terminator)

directives can appear in any order but they *must* precede the BEGIN directive. 
The BEGIN line must be followed by the LABEL line which in turn should be
followed by the test sequence.

    The test sequence should contain at least 30 and not more than 3,000 amino
acid or nucleotide characters.  But it may contain ANY NUMBER of CARRIAGE RETURN
TAB and SPACE characters; the latter are not of course counted while computing
the length of the test sequence. There is NO case sensitivity in the label and
the test sequence itself.

    The words appearing on the lines marked DIRECTIVE above can be in lower
case or upper case; in other words, you can have pam or PAM, threshold or
THRESHOLD, alignments or ALIGNMENTS, etc.  However, something like ThReShOlD
will not work.

    The directive pertaining to the scoring matrix allows the user to specify
the matrix to be used for computing the alignment scores.  You can use either
the word PAM followed by a space and the desired distance, or the word BLOSUM
followed by space and the desired distance.  Examples:  PAM 250, BLOSUM 62 etc.
If no matrix directive is included in the message, PAM 250 is used as the
default.  Depending on the values of the directive TARGET (see below) the
matrix directive if present may be ignored.

    The VERBOSE line allows the sender to also retrieve the data about authors,
dates, entries, superfamilies etc. that are contained in the original PIR and
GenBank databases.  This directive accepts one OR two arguments; for example:
                verbose         15      25
means "send me the text data for the sequences occupying positions 15 through 25
in the final ranking."  On the other hand,
                verbose         15
means "send me the text data for the sequences occupying the first 15 positions
in the final ranking."  If no verbose line appears, no citation data is sent.

    The SEQUENCES line allows one to restrict the reported sequences to the
given number.  This directive controls the number of entries in the ``short
list'' of recovered database sequences only.  If no SEQUENCES line is given,
the server code will set it to an appropriate default value.

    The ALIGNMENTS line allows one to restrict the reported alignments to the
given number.  If no ALIGNMENTS line is given, the server code will set it to
an appropriate default value.  The ALIGNMENTS value cannot exceed 1000.  Values
larger than 1000 are reduced to 1000.

    The THRESHOLD line allows one to restrict the number of reported sequences
(and thus alignments) to only those whose Score exceeds the given THRESHOLD
value.  If no THRESHOLD line is given the server code will set it to an
appropriate default value.  The default values are 50 for DNA sequences, and 80
for PROTEIN sequences.   There is also a *hard* threshold value of 20 for DNA,
and 30 for PROTEIN sequences;  if the user-requested values are smaller than
these hard-thresholds, the requested threshold will be increased accordingly.
NOTA BENE:  (1) if the THRESHOLD value is too small, you are running the danger
---------   of upsetting your mailer program since chances are that you will
            receive a very big file as a reply from the server.  
	    (2) if the THRESHOLD is too high the list of recovered entries 
	    will be empty, or very short; you should decrease the threshold's
	    value and resubmit your query.

    The TARGET line allows the user to specify the type of the target database,
as being one of { PROTEIN , DNA }.  This way the user controls the database in
which the search will be carried out.  If TARGET is set to PROTEIN, the search
will take place in the PIR database.  If TARGET is set to DNA, the search will
take place in the GenBank database.    We will soon allow users to 'mix and
match.'  I.e. the users will be able to request that amino acid sequences
be searched againt GenBank, nucleotide sequences against PIR etc. by making use
of the appropriate directives. If no TARGET line is given, the server will
assume the default value PROTEIN and thus will search against the PIR database.

    The LABEL line allows the user to enter mnemonic information pertaining the
the test sequence, the time of the day etc.  The information of this line will
be reproduced in the Subject line of the reply message.   Notice that the
LABEL line *must* begin with the character '>'.

    All the submitted messages must be terminated by the number '1'  This
number can follow the last character of the test sequence or be in a line by


EXAMPLES:  Two example inputs follow

Example 1: 
                pam 250
                sequences 50
                alignments 30
                threshold  100
		target protein
                > HBA_HUMAN STANDARD; PRT; 141 AA. P01922; HEMOGLOBIN ALPHA 

            Note:  all amino acids  from "VLSP" through "ltskyr  will be used 
            in the search.  Not more than the 50 top scoring sequences will be
            reported in the short list.  Also, the alignments for the top 30
            scoring sequences will be returned.  No reported sequence will have
            score that is less than 100.  The test sequence is declared to be a
	    sequence of amino acids and should be searched against the PIR

Example 2:
                BLOSUM 62
                > Sequence sent to dflash on Fri May 20 13:40:17 EDT 1994



            Note:  all amino acids  from "VLSP" through "ltskyr"  will be used 
            in the search.  The server code will set the various parameters to
            appropriate default values.  The server will treat the test sequence
	    as a sequence of amino acids (default) and will search against the 
	    PIR database (default) with a score threshold set at 80 (default).



    You can use both PAM and BLOSUM scoring matrices for protein searches. These
can be requested via the optional { pam, PAM, blosum, BLOSUM } directive. The
currently supported distances are

for BLOSUM:  30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90, 100

for PAM:     10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150,
             160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280,
             290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410,
             420, 430, 440, 450, 460, 470, 480, 490, and 500.

For DNA searches, the PAM/BLOSUM declarations are ignored



    The server's alignment code implements the Smith-Waterman algorithm (dynamic
programming) to align each of the retrieved sequences with the test input. This
is *NOT* to be confused with the indexing method that we use to determine the
candidates to be aligned.

    The meaning of the variables in the listing that is returned by the dFLASH


    Score Matrix: PAM250
    Max Reported Sequences:  1000
    Max Reported Alignments: 10
    Score Threshold  At: 65

     Id  Label:                                   Score  NRes  Ex% Tot% Sig  Pk
      1. HAHU hemoglobin alpha chain - human        655   141 100% 100% 100  89
      2. HACZ hemoglobin alpha chain - chimpanzee   655   141 100% 100% 100  89
      3. HACZP hemoglobin alpha chain - pygmy chi   655   141 100% 100% 100  89
      4. HAGO hemoglobin alpha chain - lowland go   654   141  99% 100%  99  89
      5. HAMQP hemoglobin alpha chain - hanuman l   653   141  97% 100%  99  89
      6. B27792 hemoglobin alpha-1 chain - orangu   649   141  97% 100%  99  89
      7. A25126 hemoglobin alpha-1 chain - Sumatr   649   141  97% 100%  99  89

is the following:

NRes:  the number of residues (amino acids) in the recovered match
Score: sequence  similarity score of the recovered sequence based on the
       selected mutation matrix
Ex%:   percentage of *exact* matching residues
Tot%:  percentage of *total* (=exact+conservative) matching residues
Sig:   100 times the ratio between the actual computed score and the score
       obtained by matching the retrieved sub-segment with itself; the
       denominator is the maximum obtainable score for the sub-segment in
       question (all gaps removed).
Peak:  the maximum score value over *any* 20 residue-window of the recovered


    You can obtain this message at any moment by sending a message with one of:
{ dflash, dFlash, dFLASH, DFLASH } in the "Subject" line and a body containing
one of { help, HELP, send help, SEND HELP }.


    You can obtain reprints (in PostScript) of relevant papers by sending a
message with one of: { dflash, dFlash, dFLASH, DFLASH } in the "Subject" line
and a body containing 

one of {flashpaper, FLASHPAPER, send flashpaper, SEND FLASHPAPER }        
                                        ---> returns to the originator of the 
                                        request a copy of the FLASH paper

one of {dflashpaper, DFLASHPAPER, send dflashpaper, SEND DFLASHPAPER }        
                                        ---> returns to the originator of the 
                                        request a copy of a paper that contains
                                        a description of dFLASH (long)

one of {concertpaper, CONCERTPAPER, send concertpaper, SEND CONCERTPAPER } 
                                        ---> returns to the originator of the 
                                        request a copy of a high-level paper
                                        describing the CONCERT/C language

one of {bayespaper, BAYESPAPER, send bayespaper, SEND BAYESPAPER } 
                                        --> returns to the originator of the 
                                        request a copy of a paper describing 
                                        a computer-vision application based 
                                        on similar to dFLASH indexing 
                                        principles (long)

    Notice there can only be *one* such request per message! Also, make sure
you do not issue a new paper request until after the previous request has
returned to you all of the postscript files and you have removed the latter
from your mailbox:  the returned messages are rather big (between 1 and 4
Megabytes) and are guaranteed to overflow the disk set aside for mail messages
on most systems.



(1) for the time being we do not incorporate incremental updates of PIR
(2) for the time being we do not incorporate incremental updates of GenBank
(3) dFLASH searches are currently available through GRAIL of the Oak Ridge
    National Laboratory.

Thank you for your interest in the dFLASH server. 


                                        The dFLASH Group


COMMENTS??  We will appreciate receiving your feedback, suggestions, comments, 
----------  or bug reports; all of these can be sent to "dflash at watson.ibm.com" 
	    Please, make sure your  "Subject" line contains the word "comments"
	    or "bug".


REFERENCES  If you make use of the dFLASH server, please reference 

     A. Califano and I. Rigoutsos, "FLASH: A Fast Look-up Algorithm for String
     Homology."  In Proceedings of the First International Conference on
     Intelligent Systems for Molecular Biology, July 1993, Bethesda, MD.

     I. Rigoutsos and A. Califano, "Searching In Parallel for Similar Protein
     Strings."  In IEEE Computational Science and Engineering, June 1994.

If you wish to find out more, you can contact Isidore Rigoutsos and Andrea
Califano at {rigoutso,acal}@watson.ibm.com


For more information on the Concert/C language, please refer to

     J. Auerbach, D. Bacon, A. Goldberg, G. Goldszmidt, A. Gopal, M. Kennedy,
     A. Lowry, J. Russell, W. Silverman, R. Strom, D. Yellin, and S. Yemini,
     "High-level language support  for programming reliable distributed
     systems."  In Proceedings of the International Conference on Computer
     Languages, April 1992, Oakland, California.

or contact Jim Russell (jrussell at watson.ibm.com)


------------------------------>  CUT HERE <-----------------------------------

More information about the Embl-db mailing list

Send comments to us at biosci-help [At] net.bio.net