Kabat database Seqhunt2 server

George Johnson george at immuno.bme.nwu.edu
Sat Mar 18 19:01:31 EST 1995


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+                                                                       +
+               The Kabat Database of Sequences of Proteins             +
+                         of Immunological Interest                     +
+                                                                       +
+              For help, questions or comments please write:            +
+                                                                       +
+              George Johnson      george at immuno.bme.nwu.edu            +
+              Tai Te Wu               tt at immuno.bme.nwu.edu            +
+                                                                       +
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



NEW ADDITIONS
-------------

RAWDATA

This feature, put before the Begin line, will return raw, unaligned
data for straight matching of the restriction you send in.  This 
feature won't work with #NM and #AM pattern matching (sort of would
defeat the purpose).  This is for people who already know which sequence
they want, and want to feed the matches into some other program, like
an alignment program or whatever for presentation.  THIS FEATURE IS
NOT FOR INDIVIDUALS WANTING TO DOWNLOAD THE ENTIRE DAMN DATABASE THROUGH
THE MAIL.  For those users, please direct yourselves toward the archive
site or talk to us.

GENERAL STUFF
-------------

A server is available at seqhunt2 at immuno.bme.nwu.edu.  Sending mail
to this address with the single word "help" in the message body will 
return this file.

This server is an improvement of the seqhunt server.

This server allows the e-mail user to interface with the complete 
database.  All sequence classes and all annotations are searchable and
returnable.  The query format is a simplification of the rather 
restricted format of the original seqhunt server.  Briefly, the server
allows you to make and/or/not constructed restrictions and allows 
nucleotide and amino acid pattern matching with differences allowed.
The dataset searched is a raw archive of the database, and thus contains
all sorts of things previously unsearchable through the other servers.

Requests are processed when they are received.  The average processing
time is about 2 minutes, depending on the complexity of the request.

The "hits" are returned in a different format than the original seqhunt
server.  This format is being applied to other distributions of the
database.  It is meant to be easier to read and easier to process by
computer programs.  The format contains a vertical alignment of the 
sequences returned, which is more familiar to users of the book.  In
all cases, the length of a line in the returned record is 80 characters
or less.

Be sure to make a note of and reference the date the request was searched.
That constitutes the "release" of the database you are working with.

REPORTING PROBLEMS
------------------

In the unlikely (huh?) event that the server crashes, you will get back
two consecutive lines like:

Processed:  Sunday, July 31, 1994:  12:27:53 PM CDT
Server finished:  Sunday, July 31, 1994:  12:28:35 PM CDT

When there is nothing in between these lines, then something ran amock.
If you can, please send me the request you sent in or the time you sent
the request and we will begin finding the bug.  Even if there are no 
matches with your request, you will get back something that says no
matches were found.


FORMAT OF A QUERY
-----------------

A query consists of two parts.  The two parts are separated by the word
"Begin" which is required in all requests.  Before the word begin, the
things you can put in are:

PAREDOWN           this causes the output of the pattern matching
                   programs to return only the stretch of sequence that
                   matched with your pattern.  The default is to return
                   the entire sequence.

MAXDOC n           specify the maximum number of hits that will be 
                   returned.  The default is 20, the maximum is 75.

STARTDOC n         specify the starting document to return.  For 
                   queries which have many hits, you may want to
                   return only 10 or so at a time.  To get the first
                   10, put in MAXDOC 10.  To get the next 10, re-
                   submit the search and put in STARTDOC 11.

PSAA               Output PostScript instead of ASCII text.
                   PSAA is used for amino acids. *

PSNUC              Output PostScript instead of ASCII text.
                   PSNUC is used for nucleotides. *

* of course, only one of these can be used at a time.

VARIB              Calculate the variability of the group of sequences
                   you hit.  The sequences are broken down into logical
                   groups of alignment, the distribution and variability
                   is computed, and the plot is returned.

DIST               Return a PostScript file containing the distribution
                   table for the group(s) of sequences you find.  Probably
                   useful in conjunction with the VARIB option.

RAWDATA            Returns unaligned sequences of nucletides and amimo
                   acids for entries matching your specified pattern.
                   Note:  This feature won't work with the pattern 
                          matching features, AM and NM.

PAREDOWN, STARTDOC, MAXDOC, PSAA, PSNUC, VARIB, DIST, and RAWDATA are
optional; they do NOT have to included.

After the word Begin, one or two things can be specified.  These things
deal with restricting the search and doing pattern matches.  The first 
thing after the word begin should be the restriction.  The words in the
restriction are searched as regular expressions.  Phospho would match
phosphocholine, phosphoboo, phosphobobo, obophosphoagogo.  The regular
expression syntax can be used in these patterns.  The symbol -, though
is reserved for unary NOT.  The key words "and" and "or" must be
included in the restriction; "mouse kappa" does not mean "mouse and
kappa".  Double quotes ("") can be used to encapsulate phrases you want
to look for.  For example, someone was wanting to find J Chains.  Well,
J AND Chain is not going to be helpful.  "J Chain" will return any
pattern containing that phrase.  Of course, a dot (.) would work just
as fine; J.Chains would return the results of "J Chains" along with 
anything else that dot could represent.  Some people are more comfortable
with quotes so they are included.

When you are looking for a group of sequences of a specific species, you
can enter @species to tell the server to look in a special list of sequences
and their species.  Without the @, the word, like human, will be searched
through all the notes and specificities and everything else.  Some people's
names have RAT somewhere in them, and without the @RAT, those sequences
would come up as matches!  (sorry to people with rat in their names).  NOT
including the "@" can be useful, like when looking for mouse anti-HUMAN 
antibodies.  @mouse and human would be good for this type of search.  To
look for all other species other than the one specified, use @-species.

Here are some examples:

mouse kappa light chains with phosphocholine specificity

The restriction would be:

     @mouse and kappa and phosphocholine

More complicated requests can be made:

mouse or human phosphocholine antibodies

The restriction would be:
     
     (mouse or human) and phosphocholine

What about rat and rabbit antibodies, but no kappa's?

     (rat and rabbit) and -kappa

The -kappa means NOT kappa.  Note that the -'s must be distributed,
-(rat and rabbit) will not work, but (-rat and -rabbit) will work.

More examples will be described below.

After the restriction, which MUST occur all on one line, the pattern
matching tools are specified if you want to.  They are:

#NM n                            #NM is nucleotide match with n 
actgactagctacgtactgacgt          allowable mismatches

#AM n                            #AM is amino acid match with n 
AKSKSLWKSKALAKDKELWS             allowable mismatches


Note the sequence pattern goes on the line IMMEDIATELY following the
#AM or #AM line.  This is for a reason.  If you can only put 80 
characters on a line, you can split the sequence across multiple lines.
This is set up so that the search pattern is the LAST thing in the 
request, so everything after #AM or #NM line is search pattern.

The #AM and #NM are applied as the second part of an AND.  For example:

mouse kappa's that have cagtacgtcagtcagtca with 3 allowable mismatches

Begin
mouse and kappa
#NM 3
cagtacgtcagtcagtca

That's all for the request.  Mouse then Kappa are ANDed, then another
AND occurs with the pattern match.

For the amino acid and nucleotide matches, you do not have to specify
a restriction.  The default is a global search over the entire database.
Now, this might sound like the most prudent thing to do, but remember
that restricting things lowers the number of bases/amino acids the
program has to run through to find your matches.  Unfortunately, the
machine the server runs on is also a machine in heavy use by us.  So,
if at all possible, please restrict the amino acid and nucleotide 
searches if you can.  There are plenty of legitimate reasons to globally
search everything, but if you are only interested in mouse kappa's, just
include the line:
  mouse and kappa
and immediately you eliminate about 15000 sequences that have to be
searched through.


Examples:

Begin
rabbit and minigene

This search returned 72 hits applying the AND.  There are rabbit
minigenes in the output, but also instances of somebody named
RABBITT who sequenced a chromosomal abberation which brought a Vh and
TCR JA MINIGENE close together.  As you can see, the hits are not
always what you want.  That tells you to be very careful about how
you format your request.


MAXDOCS 5
Begin
human
#AM 0
FYME

This search returned 4 matches, one of which is interesting for you
FYME buffs.

 [...]

   79    73    THR       
   80    74    SER       
   81    75    THR       
   82    76    SER       
   83    77    ILE       
   84    78    PHE  PHE  
   85    79    TYR  TYR  
   86    80    MET  MET  
   87    81    GLU  GLU  
   88    82    LEU       
   89   82A    SER       
   90   82B    ARG       
   91   82C    LEU       
   92    83    ARG       
   
 [...]

That doesn't belong there!  FYME is usually found in CDR1, but this time
we find it in a back loop.

To get a feeling of what of output is like, send in a request like this:

PSAA
VARIB
Begin
chicken

This request sends back some sequences which contain chicken somewhere
in the document.  It the sequences in PostScripted format and computes the
variability of the sequence groups found.  To be more daring,  substitute your
own favorite species  <-- beware, PSAA and PSNUC will not print groups of 
sequences having more than 250 members (ie, human ig heavy chains-- you don't
want to deal with that kind of output!)  But, the variability will be calculated
and plotted for any sized group.

To return ONLY chicken sequences, you would use:

PSAA
VARIB
Begin
@chicken

The difference is the @ before the species.

---------


Please let me know what you like/don't like about the server.

George Johnson      george at immuno.bme.nwu.edu



More information about the Immuno mailing list