The dFLASH Group announces the release of a new electronic mail server that
allows GENBANK and PIR similarity searches with the FLASH algorithm. The
server is accessible through the internet and is now operating 24 hours a
day, 7 days a week. Appended, you will find the new server's help file.
Sincerely,
The dFLASH Group
------------------------------> CUT HERE <-----------------------------------
!! ==================== MESSAGE OF THE DAY ========================== !!
!! !!
!! June 14, 1994: We are now supporting both GENBANK & PIR searches !!
!! Releases: PIR 40 (no incremental updates) !!
!! & GENBANK 82 (no incremental updates) !!
!! June 12, 1994: The VERBOSE directive now works correctly. !!
!! November 16, 1993: Beginning today, *NO* registration is required !!
!! in order to use the dflash server. !!
!! !!
!! ==================== MESSAGE OF THE DAY ========================== !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! !!
!! N O T A B E N E !!
!! The dFLASH server is still under development. If some of the answers do !!
!! not make sense it is very likely that this is due to a bug in our code. !!
!! Please, help us improve this service by reporting such bugs: you can !!
!! email bug reports and comments to dflash at watson.ibm.com with subject line !!
!! "bug" or "comments". !!
!! !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Dear User, welcome to the dFLASH server!
The dFLASH server is a "homologous sequence retrieval" program for PROTEIN
and DNA sequences. We currently support the following databases:
PIR -- Release 40 with *no* incremental updates.
For more information, contact: Protein Information Resource (PIR),
National Biomedical Research Foundation 3900 Reservoir Road, N.W.,
Washington, DC 20007, USA
GenBank -- Release 82 with *no* incremental updates
For more information, contact: National Center for Biotechnology
Information National Library of Medicine, 38A -- 8N805, 8600 Rockville
Pike Bethesda, MD 20894, USA
dFLASH is a parallel system running on a prototype 16-node IBM SP/1.
Intra-node communication, evidence integration and alignment are performed in
parallel and use a prototype interconnection network. The currently observed
performance is estimated to be between 2 and 5 times slower than that of a
production line SP/1. The system has been implemented using IBM's Concert/C
language for distributed programming. The server is now available 24 hours a
day, 7 days a week. Backups are perfomed daily between 01:00 and 01:30 Eastern
Standard Time, so performance is likely to be affected during this period. At
the moment we automatically alternate service between the GenBank and PIR
servers and thus certain lag may be observed: this will be alleviated in
future releases of the server code.
Incremental changes and improvements made to the server will be reflected
in the "Message of the day" at the beginning of this help file: we recommend
that users periodically issue a `send help' request for up to date information
on the server.
For the moment, we can process requests originating from email addresses of
the form
user@[machine.][subdomain.]institution.type
or
user%machine@[machine.][subdomain.]institution.type
or
"string::user"@[machine.][subdomain.]institution.type
We plan to further expand the accepted formats, depending on demand.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
HOW TO USE THE SERVER: You can use the dFLASH facilities by sending an email
---------------------- message with the appropriate syntax to the address
"dflash at watson.ibm.com" (without the quotes).
SUBJECT LINE: It is important that the "Subject" line of your message contain
------------- one of: { dflash, dFlash, dFLASH, DFLASH }. Messages whose
subject line does NOT conform to this rule, **WILL BE LEFT
UNPROCESSED**. The reason for that restriction is that we want
to be able to automatically distinguish between messages that are
addressed to the server and those that are meant for one of the
group members.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
MESSAGE FORMAT: The typical message-body of an email request looks as follows
---------------
BLOSUM 250 (optional | DIRECTIVE)
VERBOSE 10 20 (optional | DIRECTIVE)
SEQUENCES 100 (optional | DIRECTIVE)
ALIGNMENTS 50 (optional | DIRECTIVE)
THRESHOLD 30 (optional | DIRECTIVE)
TARGET PROTEIN (optional | DIRECTIVE)
BEGIN (mandatory | DIRECTIVE)
>A_ONE_LINE_TEST_SEQ_LABEL (mandatory -- notice the '>' )
a_sequence_of_{amino_acids,spaces,tabs}
1 (mandatory terminator)
The PAM/BLOSUM, VERBOSE, SEQUENCES, ALIGNMENTS, THRESHOLD, and TARGET
directives can appear in any order but they *must* precede the BEGIN directive.
The BEGIN line must be followed by the LABEL line which in turn should be
followed by the test sequence.
The test sequence should contain at least 30 and not more than 3,000 amino
acid or nucleotide characters. But it may contain ANY NUMBER of CARRIAGE RETURN
TAB and SPACE characters; the latter are not of course counted while computing
the length of the test sequence. There is NO case sensitivity in the label and
the test sequence itself.
The words appearing on the lines marked DIRECTIVE above can be in lower
case or upper case; in other words, you can have pam or PAM, threshold or
THRESHOLD, alignments or ALIGNMENTS, etc. However, something like ThReShOlD
will not work.
The directive pertaining to the scoring matrix allows the user to specify
the matrix to be used for computing the alignment scores. You can use either
the word PAM followed by a space and the desired distance, or the word BLOSUM
followed by space and the desired distance. Examples: PAM 250, BLOSUM 62 etc.
If no matrix directive is included in the message, PAM 250 is used as the
default. Depending on the values of the directive TARGET (see below) the
matrix directive if present may be ignored.
The VERBOSE line allows the sender to also retrieve the data about authors,
dates, entries, superfamilies etc. that are contained in the original PIR and
GenBank databases. This directive accepts one OR two arguments; for example:
verbose 15 25
means "send me the text data for the sequences occupying positions 15 through 25
in the final ranking." On the other hand,
verbose 15
means "send me the text data for the sequences occupying the first 15 positions
in the final ranking." If no verbose line appears, no citation data is sent.
The SEQUENCES line allows one to restrict the reported sequences to the
given number. This directive controls the number of entries in the ``short
list'' of recovered database sequences only. If no SEQUENCES line is given,
the server code will set it to an appropriate default value.
The ALIGNMENTS line allows one to restrict the reported alignments to the
given number. If no ALIGNMENTS line is given, the server code will set it to
an appropriate default value. The ALIGNMENTS value cannot exceed 1000. Values
larger than 1000 are reduced to 1000.
The THRESHOLD line allows one to restrict the number of reported sequences
(and thus alignments) to only those whose Score exceeds the given THRESHOLD
value. If no THRESHOLD line is given the server code will set it to an
appropriate default value. The default values are 50 for DNA sequences, and 80
for PROTEIN sequences. There is also a *hard* threshold value of 20 for DNA,
and 30 for PROTEIN sequences; if the user-requested values are smaller than
these hard-thresholds, the requested threshold will be increased accordingly.
NOTA BENE: (1) if the THRESHOLD value is too small, you are running the danger
--------- of upsetting your mailer program since chances are that you will
receive a very big file as a reply from the server.
(2) if the THRESHOLD is too high the list of recovered entries
will be empty, or very short; you should decrease the threshold's
value and resubmit your query.
The TARGET line allows the user to specify the type of the target database,
as being one of { PROTEIN , DNA }. This way the user controls the database in
which the search will be carried out. If TARGET is set to PROTEIN, the search
will take place in the PIR database. If TARGET is set to DNA, the search will
take place in the GenBank database. We will soon allow users to 'mix and
match.' I.e. the users will be able to request that amino acid sequences
be searched againt GenBank, nucleotide sequences against PIR etc. by making use
of the appropriate directives. If no TARGET line is given, the server will
assume the default value PROTEIN and thus will search against the PIR database.
The LABEL line allows the user to enter mnemonic information pertaining the
the test sequence, the time of the day etc. The information of this line will
be reproduced in the Subject line of the reply message. Notice that the
LABEL line *must* begin with the character '>'.
All the submitted messages must be terminated by the number '1' This
number can follow the last character of the test sequence or be in a line by
itself.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
EXAMPLES: Two example inputs follow
---------
Example 1:
pam 250
sequences 50
alignments 30
threshold 100
target protein
begin
> HBA_HUMAN STANDARD; PRT; 141 AA. P01922; HEMOGLOBIN ALPHA
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHFDLSHGSAQVKGHG KKVADALTNA
V A H V D D M PNALSALSDLHAHKLRVDPVNFK
llshcllvtlaahlpaeftpavhasldkflasvstvltskyr
1
Note: all amino acids from "VLSP" through "ltskyr will be used
in the search. Not more than the 50 top scoring sequences will be
reported in the short list. Also, the alignments for the top 30
scoring sequences will be returned. No reported sequence will have
score that is less than 100. The test sequence is declared to be a
sequence of amino acids and should be searched against the PIR
database.
Example 2:
BLOSUM 62
BEGIN
> Sequence sent to dflash on Fri May 20 13:40:17 EDT 1994
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHFDLSHGSAQVKGHG KKVADALTNA
V A H V D D M PNALSALSDLHAHKLRVDPVNFK
llshcllvtlaahlpaeftpavhasldkflasvstvltskyr
1
Note: all amino acids from "VLSP" through "ltskyr" will be used
in the search. The server code will set the various parameters to
appropriate default values. The server will treat the test sequence
as a sequence of amino acids (default) and will search against the
PIR database (default) with a score threshold set at 80 (default).
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
SCORING MATRICES:
-----------------
You can use both PAM and BLOSUM scoring matrices for protein searches. These
can be requested via the optional { pam, PAM, blosum, BLOSUM } directive. The
currently supported distances are
for BLOSUM: 30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90, 100
for PAM: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150,
160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280,
290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410,
420, 430, 440, 450, 460, 470, 480, 490, and 500.
For DNA searches, the PAM/BLOSUM declarations are ignored
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
NOTE ON ALIGNMENT:
------------------
The server's alignment code implements the Smith-Waterman algorithm (dynamic
programming) to align each of the retrieved sequences with the test input. This
is *NOT* to be confused with the indexing method that we use to determine the
candidates to be aligned.
The meaning of the variables in the listing that is returned by the dFLASH
server
.....
....
Score Matrix: PAM250
Max Reported Sequences: 1000
Max Reported Alignments: 10
Score Threshold At: 65
Id Label: Score NRes Ex% Tot% Sig Pk
----------------------------------------------------------------------------
1. HAHU hemoglobin alpha chain - human 655 141 100% 100% 100 89
2. HACZ hemoglobin alpha chain - chimpanzee 655 141 100% 100% 100 89
3. HACZP hemoglobin alpha chain - pygmy chi 655 141 100% 100% 100 89
4. HAGO hemoglobin alpha chain - lowland go 654 141 99% 100% 99 89
5. HAMQP hemoglobin alpha chain - hanuman l 653 141 97% 100% 99 89
6. B27792 hemoglobin alpha-1 chain - orangu 649 141 97% 100% 99 89
7. A25126 hemoglobin alpha-1 chain - Sumatr 649 141 97% 100% 99 89
...
.....
..
is the following:
NRes: the number of residues (amino acids) in the recovered match
Score: sequence similarity score of the recovered sequence based on the
selected mutation matrix
Ex%: percentage of *exact* matching residues
Tot%: percentage of *total* (=exact+conservative) matching residues
Sig: 100 times the ratio between the actual computed score and the score
obtained by matching the retrieved sub-segment with itself; the
denominator is the maximum obtainable score for the sub-segment in
question (all gaps removed).
Peak: the maximum score value over *any* 20 residue-window of the recovered
match
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
TO OBTAIN HELP:
---------------
You can obtain this message at any moment by sending a message with one of:
{ dflash, dFlash, dFLASH, DFLASH } in the "Subject" line and a body containing
one of { help, HELP, send help, SEND HELP }.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
TO OBTAIN ON-LINE REPRINTS OF PAPERS
------------------------------------
You can obtain reprints (in PostScript) of relevant papers by sending a
message with one of: { dflash, dFlash, dFLASH, DFLASH } in the "Subject" line
and a body containing
one of {flashpaper, FLASHPAPER, send flashpaper, SEND FLASHPAPER }
---> returns to the originator of the
request a copy of the FLASH paper
one of {dflashpaper, DFLASHPAPER, send dflashpaper, SEND DFLASHPAPER }
---> returns to the originator of the
request a copy of a paper that contains
a description of dFLASH (long)
one of {concertpaper, CONCERTPAPER, send concertpaper, SEND CONCERTPAPER }
---> returns to the originator of the
request a copy of a high-level paper
describing the CONCERT/C language
one of {bayespaper, BAYESPAPER, send bayespaper, SEND BAYESPAPER }
--> returns to the originator of the
request a copy of a paper describing
a computer-vision application based
on similar to dFLASH indexing
principles (long)
Notice there can only be *one* such request per message! Also, make sure
you do not issue a new paper request until after the previous request has
returned to you all of the postscript files and you have removed the latter
from your mailbox: the returned messages are rather big (between 1 and 4
Megabytes) and are guaranteed to overflow the disk set aside for mail messages
on most systems.
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
OTHER NOTES:
-------------
(1) for the time being we do not incorporate incremental updates of PIR
(2) for the time being we do not incorporate incremental updates of GenBank
(3) dFLASH searches are currently available through GRAIL of the Oak Ridge
National Laboratory.
Thank you for your interest in the dFLASH server.
Sincerely,
The dFLASH Group
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
COMMENTS?? We will appreciate receiving your feedback, suggestions, comments,
---------- or bug reports; all of these can be sent to "dflash at watson.ibm.com"
Please, make sure your "Subject" line contains the word "comments"
or "bug".
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
REFERENCES If you make use of the dFLASH server, please reference
----------
A. Califano and I. Rigoutsos, "FLASH: A Fast Look-up Algorithm for String
Homology." In Proceedings of the First International Conference on
Intelligent Systems for Molecular Biology, July 1993, Bethesda, MD.
I. Rigoutsos and A. Califano, "Searching In Parallel for Similar Protein
Strings." In IEEE Computational Science and Engineering, June 1994.
If you wish to find out more, you can contact Isidore Rigoutsos and Andrea
Califano at {rigoutso,acal}@watson.ibm.com
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
For more information on the Concert/C language, please refer to
J. Auerbach, D. Bacon, A. Goldberg, G. Goldszmidt, A. Gopal, M. Kennedy,
A. Lowry, J. Russell, W. Silverman, R. Strom, D. Yellin, and S. Yemini,
"High-level language support for programming reliable distributed
systems." In Proceedings of the International Conference on Computer
Languages, April 1992, Oakland, California.
or contact Jim Russell (jrussell at watson.ibm.com)
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$
------------------------------> CUT HERE <-----------------------------------