Kabat database Seqhunt2 server
George Johnson
george at immuno.bme.nwu.edu
Sat Mar 18 19:01:31 EST 1995
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ +
+ The Kabat Database of Sequences of Proteins +
+ of Immunological Interest +
+ +
+ For help, questions or comments please write: +
+ +
+ George Johnson george at immuno.bme.nwu.edu +
+ Tai Te Wu tt at immuno.bme.nwu.edu +
+ +
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
NEW ADDITIONS
-------------
RAWDATA
This feature, put before the Begin line, will return raw, unaligned
data for straight matching of the restriction you send in. This
feature won't work with #NM and #AM pattern matching (sort of would
defeat the purpose). This is for people who already know which sequence
they want, and want to feed the matches into some other program, like
an alignment program or whatever for presentation. THIS FEATURE IS
NOT FOR INDIVIDUALS WANTING TO DOWNLOAD THE ENTIRE DAMN DATABASE THROUGH
THE MAIL. For those users, please direct yourselves toward the archive
site or talk to us.
GENERAL STUFF
-------------
A server is available at seqhunt2 at immuno.bme.nwu.edu. Sending mail
to this address with the single word "help" in the message body will
return this file.
This server is an improvement of the seqhunt server.
This server allows the e-mail user to interface with the complete
database. All sequence classes and all annotations are searchable and
returnable. The query format is a simplification of the rather
restricted format of the original seqhunt server. Briefly, the server
allows you to make and/or/not constructed restrictions and allows
nucleotide and amino acid pattern matching with differences allowed.
The dataset searched is a raw archive of the database, and thus contains
all sorts of things previously unsearchable through the other servers.
Requests are processed when they are received. The average processing
time is about 2 minutes, depending on the complexity of the request.
The "hits" are returned in a different format than the original seqhunt
server. This format is being applied to other distributions of the
database. It is meant to be easier to read and easier to process by
computer programs. The format contains a vertical alignment of the
sequences returned, which is more familiar to users of the book. In
all cases, the length of a line in the returned record is 80 characters
or less.
Be sure to make a note of and reference the date the request was searched.
That constitutes the "release" of the database you are working with.
REPORTING PROBLEMS
------------------
In the unlikely (huh?) event that the server crashes, you will get back
two consecutive lines like:
Processed: Sunday, July 31, 1994: 12:27:53 PM CDT
Server finished: Sunday, July 31, 1994: 12:28:35 PM CDT
When there is nothing in between these lines, then something ran amock.
If you can, please send me the request you sent in or the time you sent
the request and we will begin finding the bug. Even if there are no
matches with your request, you will get back something that says no
matches were found.
FORMAT OF A QUERY
-----------------
A query consists of two parts. The two parts are separated by the word
"Begin" which is required in all requests. Before the word begin, the
things you can put in are:
PAREDOWN this causes the output of the pattern matching
programs to return only the stretch of sequence that
matched with your pattern. The default is to return
the entire sequence.
MAXDOC n specify the maximum number of hits that will be
returned. The default is 20, the maximum is 75.
STARTDOC n specify the starting document to return. For
queries which have many hits, you may want to
return only 10 or so at a time. To get the first
10, put in MAXDOC 10. To get the next 10, re-
submit the search and put in STARTDOC 11.
PSAA Output PostScript instead of ASCII text.
PSAA is used for amino acids. *
PSNUC Output PostScript instead of ASCII text.
PSNUC is used for nucleotides. *
* of course, only one of these can be used at a time.
VARIB Calculate the variability of the group of sequences
you hit. The sequences are broken down into logical
groups of alignment, the distribution and variability
is computed, and the plot is returned.
DIST Return a PostScript file containing the distribution
table for the group(s) of sequences you find. Probably
useful in conjunction with the VARIB option.
RAWDATA Returns unaligned sequences of nucletides and amimo
acids for entries matching your specified pattern.
Note: This feature won't work with the pattern
matching features, AM and NM.
PAREDOWN, STARTDOC, MAXDOC, PSAA, PSNUC, VARIB, DIST, and RAWDATA are
optional; they do NOT have to included.
After the word Begin, one or two things can be specified. These things
deal with restricting the search and doing pattern matches. The first
thing after the word begin should be the restriction. The words in the
restriction are searched as regular expressions. Phospho would match
phosphocholine, phosphoboo, phosphobobo, obophosphoagogo. The regular
expression syntax can be used in these patterns. The symbol -, though
is reserved for unary NOT. The key words "and" and "or" must be
included in the restriction; "mouse kappa" does not mean "mouse and
kappa". Double quotes ("") can be used to encapsulate phrases you want
to look for. For example, someone was wanting to find J Chains. Well,
J AND Chain is not going to be helpful. "J Chain" will return any
pattern containing that phrase. Of course, a dot (.) would work just
as fine; J.Chains would return the results of "J Chains" along with
anything else that dot could represent. Some people are more comfortable
with quotes so they are included.
When you are looking for a group of sequences of a specific species, you
can enter @species to tell the server to look in a special list of sequences
and their species. Without the @, the word, like human, will be searched
through all the notes and specificities and everything else. Some people's
names have RAT somewhere in them, and without the @RAT, those sequences
would come up as matches! (sorry to people with rat in their names). NOT
including the "@" can be useful, like when looking for mouse anti-HUMAN
antibodies. @mouse and human would be good for this type of search. To
look for all other species other than the one specified, use @-species.
Here are some examples:
mouse kappa light chains with phosphocholine specificity
The restriction would be:
@mouse and kappa and phosphocholine
More complicated requests can be made:
mouse or human phosphocholine antibodies
The restriction would be:
(mouse or human) and phosphocholine
What about rat and rabbit antibodies, but no kappa's?
(rat and rabbit) and -kappa
The -kappa means NOT kappa. Note that the -'s must be distributed,
-(rat and rabbit) will not work, but (-rat and -rabbit) will work.
More examples will be described below.
After the restriction, which MUST occur all on one line, the pattern
matching tools are specified if you want to. They are:
#NM n #NM is nucleotide match with n
actgactagctacgtactgacgt allowable mismatches
#AM n #AM is amino acid match with n
AKSKSLWKSKALAKDKELWS allowable mismatches
Note the sequence pattern goes on the line IMMEDIATELY following the
#AM or #AM line. This is for a reason. If you can only put 80
characters on a line, you can split the sequence across multiple lines.
This is set up so that the search pattern is the LAST thing in the
request, so everything after #AM or #NM line is search pattern.
The #AM and #NM are applied as the second part of an AND. For example:
mouse kappa's that have cagtacgtcagtcagtca with 3 allowable mismatches
Begin
mouse and kappa
#NM 3
cagtacgtcagtcagtca
That's all for the request. Mouse then Kappa are ANDed, then another
AND occurs with the pattern match.
For the amino acid and nucleotide matches, you do not have to specify
a restriction. The default is a global search over the entire database.
Now, this might sound like the most prudent thing to do, but remember
that restricting things lowers the number of bases/amino acids the
program has to run through to find your matches. Unfortunately, the
machine the server runs on is also a machine in heavy use by us. So,
if at all possible, please restrict the amino acid and nucleotide
searches if you can. There are plenty of legitimate reasons to globally
search everything, but if you are only interested in mouse kappa's, just
include the line:
mouse and kappa
and immediately you eliminate about 15000 sequences that have to be
searched through.
Examples:
Begin
rabbit and minigene
This search returned 72 hits applying the AND. There are rabbit
minigenes in the output, but also instances of somebody named
RABBITT who sequenced a chromosomal abberation which brought a Vh and
TCR JA MINIGENE close together. As you can see, the hits are not
always what you want. That tells you to be very careful about how
you format your request.
MAXDOCS 5
Begin
human
#AM 0
FYME
This search returned 4 matches, one of which is interesting for you
FYME buffs.
[...]
79 73 THR
80 74 SER
81 75 THR
82 76 SER
83 77 ILE
84 78 PHE PHE
85 79 TYR TYR
86 80 MET MET
87 81 GLU GLU
88 82 LEU
89 82A SER
90 82B ARG
91 82C LEU
92 83 ARG
[...]
That doesn't belong there! FYME is usually found in CDR1, but this time
we find it in a back loop.
To get a feeling of what of output is like, send in a request like this:
PSAA
VARIB
Begin
chicken
This request sends back some sequences which contain chicken somewhere
in the document. It the sequences in PostScripted format and computes the
variability of the sequence groups found. To be more daring, substitute your
own favorite species <-- beware, PSAA and PSNUC will not print groups of
sequences having more than 250 members (ie, human ig heavy chains-- you don't
want to deal with that kind of output!) But, the variability will be calculated
and plotted for any sized group.
To return ONLY chicken sequences, you would use:
PSAA
VARIB
Begin
@chicken
The difference is the @ before the species.
---------
Please let me know what you like/don't like about the server.
George Johnson george at immuno.bme.nwu.edu
More information about the Immuno
mailing list