The dFLASH group announces the availability of a new release of the dFLASH
email server. The current version of the server represents a new release and
features:
o greatly improved sensitivity
o improved mail interface
o improved syntax-error checking
o ability to recover text data pertaining to the retrieved sequences (see the
option VERBOSE below)
o new, more comprehensive protein database (we now use PIR Rel. 38)
o a new computational platform
o much improved alignment results
o new request handling approach: all of the submitted requests will
eventually be processed, independent of whether the server is running upon
reception of the request; users will not receive the "server UNavailable"
messages anymore, and there will be no need for re-submission.
Also, *no* registration is required anymore in order to use the dFLASH server.
All submitted requests will be honored, assuming that the senders' email address
conforms to the accepted formats (see below). At the end, the server's help
file is included with details on the use of the system.
Finally, we would like to mention that we are looking forward to receiving
your feedback on the current version of the server. We will greatly appreciate
receiving your suggestions for modifications, comments, criticism etc which
should be forwarded to dflash at watson.ibm.com (Subject: Comments).
Sincerely,
The dFLASH Group
---------------------------------- Cut Here ----------------------------------
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! N O T A B E N E !!
!! The dFLASH server is still under development. If some of the answers do !!
!! not make sense it is very likely that this is due to a bug in our code. !!
!! !!
!! Reporting of such bugs will help us to incorporate all the needed fixes. !!
!! !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! !!
!! The database that we use is PIR Release 38 with *no* incremental updates. !!
!! For more information, contact: !!
!! !!
!! Protein Information Resource (PIR) !!
!! National Biomedical Research Foundation !!
!! 3900 Reservoir Road, N.W., !!
!! Washington, DC 20007, USA !!
!! !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Dear User, welcome to the dFLASH server!
The dFLASH server is a "homologous sequence retrieval" program for protein
sequences (see also NOTES below). dFLASH is a distributed system which runs
on a 16-node IBM SP/1. Although, the SP/1 has a fast interconnection network
for intra-node communication, dFLASH currently uses regular TCP/IP for message
delivery. Furthermore, evidence integration and alignment areperformed on a
single node, instead of in parallel on all 16 nodes. As is evidenced by the
difference in the total CPU usage and the elapsed wall clock time, a large
portion of the total time is consumed by the network communicationand the
serial processing. We will soon exploit the SP/1's fast interconnection
feature and also parallelize the evidence intergation/alignment code resulting
in an expected 16-fold speedup. The system has been implemented using IBM's
Concert/C language for distributed programming. The server is now available 24
hours a day, 7 days a week. Meanwhile, incremental changes and improvements
made to the server will be reflected in the text of this help file: it is
recommended that users periodically issue a `send help' request for up to date
information on the server.
Effective today, November 16, 1993, *no* registration is required in order
to use the dFLASH server.
For the moment, we can process requests originating from email addresses of the
form
"user@[machine.]institution.type"
or
"user%machine@[machine.]institution.type"
We plan to further expand the accepted formats, depending on demand.
You can use the dFLASH facilities by sending an email message to
"dflash at watson.ibm.com"
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
VERY IMPORTANT: the "Subject" line of the message should be one of: { dflash,
--------------- dFlash, dFLASH, DFLASH }. Messages whose subject line does
not conform to this rule, will be left **unprocessed**. The reason
for that restriction is that we want to be able to automatically
distinguish between messages that are addressed to the server and
those that are meant for one of the group members.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
REQUEST FORMAT:
---------------
The typical message-body of an email request looks like:
PAM 250 (mandatory | DIRECTIVE)
VERBOSE 10 20 (optional | DIRECTIVE)
SEQUENCES 100 (optional | DIRECTIVE)
ALIGNMENTS 50 (optional | DIRECTIVE)
THRESHOLD 30 (optional | DIRECTIVE)
BEGIN (mandatory | DIRECTIVE)
>A_ONE_LINE_TEST_SEQ_LABEL (mandatory -- notice the '>' )
a_sequence_of_{amino_acids,spaces,tabs}
1 (mandatory terminator)
The PAM/BLOSUM, VERBOSE, SEQUENCES, ALIGNMENTS, and THRESHOLD directives can
appear in any order but they *must* precede the BEGIN directive. The BEGIN
line must precede the LABEL line, and the latter must precede the test sequence.
The test sequence should contain at least 30 and not more than 1,000 aminoacids.
BUT it *may* contain CARRIAGE RETURNS, TABS and SPACES. There is NO case
sensitivity in the label and the test sequence itself.
The words appearing on the lines marked DIRECTIVE above can be in lower case or
upper case; in other words, you can have pam or PAM, threshold or THRESHOLD,
alignments or ALIGNMENTS, etc. However, something like ThReShOlD will not work.
The VERBOSE line allows the sender to also retrieve the data about authors,
dates, entries, superfamilies etc. that are contained in the original PIR
database. This directive can take one or two arguments; for example:
verbose 15 25
means "send me the text data for the proteins occupying positions 15 through 25
in the final ranking." On the other hand,
verbose 15
means "send me the text data for the proteins occupying the first 15 positions
in the final ranking." If no verbose line appears, no text data will be sent.
The SEQUENCES line allows one to restrict the reported sequences to the given
number. This directive controls the number of entries in the ``short list''
of recovered database sequences only. If no SEQUENCES line is given, the
server code will set it to an appropriate default value.
The ALIGNMENTS line allows one to restrict the reported alignments to the given
number. If no ALIGNMENTS line is given, the server code will set it to an
appropriate default value. The ALIGNMENTS value cannot exceed 1000. Values
larger than 1000 are reduced to 1000.
The THRESHOLD line allows one to restrict the number of reported sequences (and
thus alignments) to only those whose Score exceeds the given THRESHOLD value.
If no THRESHOLD line is given the server code will set it to an appropriate
default value. The THRESHOLD value cannot be less than 30. Values smaller
than 30 are increased to 30. Notice: if the THRESHOLD value is too small, you
are running the danger of upsetting your mailer program since chances are that
you will receive a very big file as a reply from the server.
The LABEL line *must* now be preceded by the character '>'.
Finally, notice that you need to terminate the sequence with the terminator '1'.
Two example requests follow:
Example 1:
pam 250
sequences 50
alignments 30
threshold 100
begin
> HBA_HUMAN STANDARD; PRT; 141 AA. P01922; HEMOGLOBIN ALPHA
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHFDLSHGSAQVKGHG KKVADALTNA
V A H V D D M PNALSALSDLHAHKLRVDPVNFK
llshcllvtlaahlpaeftpavhasldkflasvstvltskyr
1
Note: all amino acids from "VLSP" through "ltskyr will be used
in the search. Not more than the 50 top scoring sequences will be
reported in the short list. Also, the alignments for the top 30
scoring sequences will be returned. No reported sequence will have
score that is less than 100.
Example 2:
BLOSUM 62
BEGIN
> Your-Favorite-Label Goes Here
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHFDLSHGSAQVKGHG KKVADALTNA
V A H V D D M PNALSALSDLHAHKLRVDPVNFK
llshcllvtlaahlpaeftpavhasldkflasvstvltskyr
1
Note: all amino acids from "VLSP" through "ltskyr" will be used
in the search. The server code will set the various parameters to
appropriate default values.
SCORING MATRICES:
-----------------
You can use both PAM and BLOSUM scoring matrices. These can be requested via
one of { pam, PAM, blosum, BLOSUM }. The currently supported distances are
for BLOSUM: 30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90, 100
for PAM: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150,
160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280,
290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410,
420, 430, 440, 450, 460, 470, 480, 490, and 500.
NOTE ON ALIGNMENT:
------------------
The server's alignment code now implements dynamic programming. This is not to
be confused with the indexing method that is used to determine the candidates
to align.
The meaning of the variables in the listing that is returned by the dFLASH
server
.....
....
Score Matrix: PAM250
Max Reported Sequences: 1000
Max Reported Alignments: 10
Score Threshold At: 65
Id Label: Score NRes Ex% Tot% Sig Pk
----------------------------------------------------------------------------
1. HAHU hemoglobin alpha chain - human 655 141 100% 100% 100 89
2. HACZ hemoglobin alpha chain - chimpanzee 655 141 100% 100% 100 89
3. HACZP hemoglobin alpha chain - pygmy chi 655 141 100% 100% 100 89
4. HAGO hemoglobin alpha chain - lowland go 654 141 99% 100% 99 89
5. HAMQP hemoglobin alpha chain - hanuman l 653 141 97% 100% 99 89
6. B27792 hemoglobin alpha-1 chain - orangu 649 141 97% 100% 99 89
7. A25126 hemoglobin alpha-1 chain - Sumatr 649 141 97% 100% 99 89
...
.....
..
is the following:
NRes: the number of residues (amino acids) in the recovered match
Score: sequence similarity score of the recovered sequence based on the
selected mutation matrix
Ex%: percentage of *exact* matching residues
Tot%: percentage of *total* (=exact+conservative) matching residues
Sig: 100 times the ratio between the actual computed score and the score
obtained by matching the retrieved sub-segment with itself; the
denominator is the maximum obtainable score for the sub-segment in
question (all gaps removed).
Peak: the maximum score value over *any* 20 residue-window of the recovered
match
TO OBTAIN HELP:
---------------
You can obtain this message at any moment by sending a message with one of:
{ dflash, dFlash, dFLASH, DFLASH } in the "Subject" line and a body containing
one of { help, HELP, send help, SEND HELP }.
TO OBTAIN ON-LINE REPRINTS OF PAPERS
------------------------------------
You can obtain reprints (in PostScript) of relevant papers by sending a
message with one of: { dflash, dFlash, dFLASH, DFLASH } in the "Subject" line
and a body containing
one of {flashpaper, FLASHPAPER, send flashpaper, SEND FLASHPAPER }
---> returns to the originator of the
request a copy of the FLASH paper
one of {dflashpaper, DFLASHPAPER, send dflashpaper, SEND DFLASHPAPER }
---> returns to the originator of the
request a copy of a paper that contains
a description of dFLASH (long)
one of {concertpaper, CONCERTPAPER, send concertpaper, SEND CONCERTPAPER }
---> returns to the originator of the
request a copy of a high-level paper
describing the CONCERT/C language
one of {bayespaper, BAYESPAPER, send bayespaper, SEND BAYESPAPER }
--> returns to the originator of the
request a copy of a paper describing
a computer-vision application based
on similar to dFLASH indexing
principles (long)
Notice there can only be *one* such request per message!
OTHER NOTES:
-------------
(1) for the time being we do not incorporate incremental updates of PIR.
(2) the reply from the server now contains the label on its Subject line; we
thought this might be useful to some users.
(3) format checking and error reporting have been improved considerably.
(4) at the moment we are putting together the version of the server that will
allow sequence searches in GenBank. The current projection is that the
GenBank search server will be available before the middle of January.
(5) dFLASH searches are currently available through GRAIL of the Oak Ridge
National Laboratory.
Thank you for your interest in the dFLASH server.
Sincerely,
The dFLASH Group
###############################################################################
COMMENTS??
----------
We will appreciate receiving your feedback, suggestions, comments, or bug
reports; all of these can be sent to "dflash at watson.ibm.com" Please, make sure
your "Subject" line contains the word "comments".
###############################################################################
REFERENCES
----------
If you make use of the dFLASH server, please reference
A. Califano and I. Rigoutsos, "FLASH: A Fast Look-up Algorithm for String
Homology." In Proceedings of the First International Conference on
Intelligent Systems for Molecular Biology, July 1993, Bethesda, MD.
If you wish to find out more about the dFLASH server, you can contact Andrea
Califano (acal at watson.ibm.com) or Isidore Rigoutsos (rigoutso at watson.ibm.com)
###############################################################################
For more information on the Concert/C language, please refer to
J. Auerbach, D. Bacon, A. Goldberg, G. Goldszmidt, A. Gopal, M. Kennedy,
A. Lowry, J. Russell, W. Silverman, R. Strom, D. Yellin, and S. Yemini,
"High-level language support for programming reliable distributed
systems." In Proceedings of the International Conference on Computer
Languages, April 1992, Oakland, California.
or contact Josh Auerbach (jsa at watson.ibm.com)
###############################################################################
---------------------------------- Cut Here ----------------------------------