a dFLASH server update

The dFLASH Project dflash at watson.ibm.com
Wed Dec 22 16:08:44 EST 1993



The dFLASH group announces the availability of a new release  of the dFLASH
email server.  The current version of the server represents a new release and
features:

o  greatly improved sensitivity 
o  improved mail interface
o  improved syntax-error checking
o  ability to recover text data pertaining to the retrieved sequences (see the
   option VERBOSE below)
o  new, more comprehensive protein database (we now use PIR Rel. 38)
o  a new computational platform
o  much improved alignment results
o  new request handling approach:  all of the submitted requests will
   eventually be processed, independent of whether the server is running upon
   reception of the request;  users will not receive the "server UNavailable"
   messages anymore, and there will be no need for re-submission.


Also, *no* registration is required anymore in order to use the dFLASH server. 
All submitted requests will be honored, assuming that the senders' email address
conforms to the accepted formats (see below). At the end, the server's help
file is included with details on the use of the system.

Finally, we would like to mention that we are looking forward to receiving
your feedback on the current version of the server.  We will greatly appreciate
receiving your suggestions for modifications, comments, criticism etc which
should be forwarded to dflash at watson.ibm.com (Subject: Comments).


Sincerely,


The dFLASH Group





---------------------------------- Cut Here  ----------------------------------


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!                             N O T A     B E N E                           !!
!! The dFLASH server is still under development.  If some of the answers do  !!
!! not make sense it is very likely that this is due to a bug in our code.   !!
!!                                                                           !!
!! Reporting of such bugs will help us to incorporate all the needed fixes.  !!
!!                                                                           !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!                                                                           !!
!! The database that we use is PIR Release 38 with *no* incremental updates. !!
!! For more information, contact: 					     !!
!!                                                                           !!
!!		     Protein Information Resource (PIR) 		     !!
!!		   National Biomedical Research Foundation		     !!
!!		   	   3900 Reservoir Road, N.W.,		             !!
!!		   	  Washington, DC  20007, USA		     	     !!
!!                                                                           !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    Dear User, welcome to the dFLASH server!

    The dFLASH server is a "homologous sequence retrieval" program for protein
sequences (see also NOTES below).  dFLASH is a  distributed system which runs
on a 16-node IBM SP/1.  Although, the SP/1 has a fast interconnection network
for intra-node communication, dFLASH currently uses regular TCP/IP for message
delivery.  Furthermore, evidence integration and alignment areperformed on a
single node, instead of in parallel on all 16 nodes.  As is evidenced by the
difference in the total CPU usage and the elapsed wall clock time, a large
portion of the total time is consumed by the network communicationand the
serial processing.  We will soon exploit the SP/1's fast interconnection
feature and also parallelize the evidence intergation/alignment code resulting
in an expected  16-fold speedup. The system has been implemented using IBM's
Concert/C language for distributed programming. The server is now available 24
hours a day, 7 days a week.   Meanwhile, incremental changes and improvements
made to the server will be reflected in the text of this help file:  it is
recommended that users periodically issue a `send help' request for up to date
information on the server.



    Effective today, November 16, 1993, *no* registration is required in order
to use the dFLASH server.  



For the moment, we can process requests originating from email addresses of the
form 
		"user@[machine.]institution.type"  
			or 
		"user%machine@[machine.]institution.type"  
We plan to further expand the accepted formats, depending on demand.

    You can use the dFLASH facilities by sending an email message to 

			"dflash at watson.ibm.com"

$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$

VERY IMPORTANT:    the "Subject" line of the message should be one of: { dflash,
---------------    dFlash, dFLASH, DFLASH }.  Messages whose subject line does
	      not conform to this rule, will be left **unprocessed**. The reason
	      for that restriction is that we want to be able to automatically
	      distinguish between messages that are addressed to the server and
	      those that are meant for one  of the group members.

$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$


REQUEST FORMAT:
---------------
The typical message-body of an email request looks like:

     PAM   250  				(mandatory | DIRECTIVE)
     VERBOSE  10 20				(optional  | DIRECTIVE)
     SEQUENCES  100  				(optional  | DIRECTIVE)
     ALIGNMENTS 50  				(optional  | DIRECTIVE)
     THRESHOLD  30  				(optional  | DIRECTIVE)
     BEGIN 					(mandatory | DIRECTIVE)
     >A_ONE_LINE_TEST_SEQ_LABEL               	(mandatory -- notice the '>' )
     a_sequence_of_{amino_acids,spaces,tabs}
     1						(mandatory terminator)

The PAM/BLOSUM, VERBOSE, SEQUENCES, ALIGNMENTS, and THRESHOLD directives can
appear in any order but they *must* precede the BEGIN directive.  The BEGIN
line must precede the LABEL line, and the latter must precede the test sequence.
The test sequence should contain at least 30 and not more than 1,000 aminoacids.
BUT it *may* contain CARRIAGE RETURNS, TABS and SPACES.  There is NO case
sensitivity in the label and the test sequence itself.

The words appearing on the lines marked DIRECTIVE above can be in lower case or
upper case; in other words, you can have pam or PAM, threshold or THRESHOLD,
alignments or ALIGNMENTS, etc.  However, something like ThReShOlD will not work.

The VERBOSE line allows the sender to also retrieve the data about authors,
dates, entries, superfamilies etc. that are contained in the original PIR 
database.  This directive can take one or two arguments; for example:
		verbose 	15 	25
means "send me the text data for the proteins occupying positions 15 through 25
in the final ranking."  On the other hand,
		verbose 	15
means "send me the text data for the proteins occupying the first 15 positions
in the final ranking."  If no verbose line appears, no text data will be sent.


The SEQUENCES line allows one to restrict the reported sequences to the given
number.  This directive controls the number of entries in the ``short list''
of recovered database sequences only.  If no SEQUENCES line is given, the
server code will set it to an appropriate default value.


The ALIGNMENTS line allows one to restrict the reported alignments to the given
number.  If no ALIGNMENTS line is given, the server code will set it to an
appropriate default value.  The ALIGNMENTS value cannot exceed 1000.  Values
larger than 1000 are reduced to 1000.


The THRESHOLD line allows one to restrict the number of reported sequences (and
thus alignments) to only those whose Score exceeds the given THRESHOLD value. 
If no THRESHOLD line is given the server code will set it to an appropriate
default value.  The THRESHOLD value cannot be less than 30.  Values smaller
than 30 are increased to 30. Notice:  if the THRESHOLD value is too small, you
are running the danger of upsetting your mailer program since chances are that
you will receive a very big file as a reply from the server.


The LABEL line *must* now be preceded by the character '>'.


Finally, notice that you need to terminate the sequence with the terminator '1'.


Two example requests follow:

Example 1: 
		pam 250
		sequences 50
		alignments 30
		threshold  100
		begin
		> HBA_HUMAN STANDARD; PRT; 141 AA. P01922; HEMOGLOBIN ALPHA 
		VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
		TTKTYFPHFDLSHGSAQVKGHG     KKVADALTNA
		V A H V D D M PNALSALSDLHAHKLRVDPVNFK
		llshcllvtlaahlpaeftpavhasldkflasvstvltskyr
		1

            Note:  all amino acids  from "VLSP" through "ltskyr  will be used 
	    in the search.  Not more than the 50 top scoring sequences will be
	    reported in the short list.  Also, the alignments for the top 30
	    scoring sequences will be returned.  No reported sequence will have
	    score that is less than 100.

Example 2:
		BLOSUM 62
		BEGIN
		>     Your-Favorite-Label Goes Here
		VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
		TTKTYFPHFDLSHGSAQVKGHG     KKVADALTNA

		V A H V D D M PNALSALSDLHAHKLRVDPVNFK

		llshcllvtlaahlpaeftpavhasldkflasvstvltskyr
		1

     	    Note:  all amino acids  from "VLSP" through "ltskyr"  will be used 
	    in the search.  The server code will set the various parameters to
	    appropriate default values.



SCORING MATRICES:
-----------------
You can use both PAM and BLOSUM scoring matrices. These can be requested via
one of { pam, PAM, blosum, BLOSUM }. The currently supported distances are

for BLOSUM:  30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90, 100

for PAM:     10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150,
	     160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280,
	     290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410,
	     420, 430, 440, 450, 460, 470, 480, 490, and 500.


NOTE ON ALIGNMENT:
------------------
The server's alignment code now implements dynamic programming.  This is not to
be confused with the indexing method that is used to determine the candidates
to align.

The meaning of the variables in the listing that is returned by the dFLASH
server

   .....
   ....

   Score Matrix: PAM250
   Max Reported Sequences:  1000
   Max Reported Alignments: 10
   Score Threshold  At: 65

     Id  Label:                                   Score  NRes  Ex% Tot% Sig  Pk
   ----------------------------------------------------------------------------
      1. HAHU hemoglobin alpha chain - human        655   141 100% 100% 100  89
      2. HACZ hemoglobin alpha chain - chimpanzee   655   141 100% 100% 100  89
      3. HACZP hemoglobin alpha chain - pygmy chi   655   141 100% 100% 100  89
      4. HAGO hemoglobin alpha chain - lowland go   654   141  99% 100%  99  89
      5. HAMQP hemoglobin alpha chain - hanuman l   653   141  97% 100%  99  89
      6. B27792 hemoglobin alpha-1 chain - orangu   649   141  97% 100%  99  89
      7. A25126 hemoglobin alpha-1 chain - Sumatr   649   141  97% 100%  99  89
    ...
    .....
    ..

is the following:

NRes:  the number of residues (amino acids) in the recovered match
Score: sequence  similarity score of the recovered sequence based on the
       selected mutation matrix
Ex%:   percentage of *exact* matching residues
Tot%:  percentage of *total* (=exact+conservative) matching residues
Sig:   100 times the ratio between the actual computed score and the score
       obtained by matching the retrieved sub-segment with itself; the
       denominator is the maximum obtainable score for the sub-segment in
       question (all gaps removed).
Peak:  the maximum score value over *any* 20 residue-window of the recovered
       match



TO OBTAIN HELP:
---------------
    You can obtain this message at any moment by sending a message with one of:
{ dflash, dFlash, dFLASH, DFLASH } in the "Subject" line and a body containing
one of { help, HELP, send help, SEND HELP }.


TO OBTAIN ON-LINE REPRINTS OF PAPERS
------------------------------------
    You can obtain reprints (in PostScript) of relevant papers by sending a
message with one of: { dflash, dFlash, dFLASH, DFLASH } in the "Subject" line
and a body containing 

one of {flashpaper, FLASHPAPER, send flashpaper, SEND FLASHPAPER }        
					---> returns to the originator of the 
					request a copy of the FLASH paper

one of {dflashpaper, DFLASHPAPER, send dflashpaper, SEND DFLASHPAPER }        
					---> returns to the originator of the 
					request a copy of a paper that contains
					a description of dFLASH (long)

one of {concertpaper, CONCERTPAPER, send concertpaper, SEND CONCERTPAPER } 
                                        ---> returns to the originator of the 
					request a copy of a high-level paper
					describing the CONCERT/C language

one of {bayespaper, BAYESPAPER, send bayespaper, SEND BAYESPAPER } 
                                	--> returns to the originator of the 
					request a copy of a paper describing 
					a computer-vision application based 
					on similar to dFLASH indexing 
					principles (long)

Notice there can only be *one* such request per message!



OTHER  NOTES:
-------------

(1) for the time being we do not incorporate incremental updates of PIR.
(2) the reply from the server now contains the label on its Subject line; we
    thought this might be useful to some users.
(3) format checking and error reporting have been improved considerably.
(4) at the moment we are putting together the version of the server that will
    allow sequence searches in GenBank.  The current projection is that the
    GenBank search server will be available before the middle of January.
(5) dFLASH searches are currently available through GRAIL of the Oak Ridge
    National Laboratory.

Thank you for your interest in the dFLASH server. 

					Sincerely,

					The dFLASH Group


###############################################################################

COMMENTS??
----------
We will appreciate receiving your feedback, suggestions, comments, or bug
reports; all of these can be sent to "dflash at watson.ibm.com"  Please, make sure
your  "Subject" line contains the word "comments".

###############################################################################

REFERENCES
----------

If you make use of the dFLASH server, please reference 

     A. Califano and I. Rigoutsos, "FLASH: A Fast Look-up Algorithm for String
     Homology."  In Proceedings of the First International Conference on
     Intelligent Systems for Molecular Biology, July 1993, Bethesda, MD.

If you wish to find out more about the dFLASH server, you can contact Andrea
Califano (acal at watson.ibm.com) or Isidore Rigoutsos (rigoutso at watson.ibm.com)

###############################################################################


For more information on the Concert/C language, please refer to

     J. Auerbach, D. Bacon, A. Goldberg, G. Goldszmidt, A. Gopal, M. Kennedy,
     A. Lowry, J. Russell, W. Silverman, R. Strom, D. Yellin, and S. Yemini,
     "High-level language support  for programming reliable distributed
     systems."  In Proceedings of the International Conference on Computer
     Languages, April 1992, Oakland, California.

or contact Josh Auerbach (jsa at watson.ibm.com)

###############################################################################


---------------------------------- Cut Here  ----------------------------------





More information about the Embl-db mailing list