Searching the Kabat Database of Sequences of Proteins of Immunological
Interest with Seqhunt.
Description
===========================================================================
Seqhunt is a set of routines we use here to search and analyze the Kabat
database. Recent modifications to the seqhunt set of routines has allowed
us to offer searches of the database to others through the electronic mail.
Most of the database is accessible for searching through the electronic
mail implementation. Pseudogenes, D-minigenes, and J-minigenes are
currently not accessible for searches.
**IMPORTANT**
Seqhunt is NOT an alignment program. The sequences in the database that
are aligned are done so by visual inspection; the alignment is forced
using the Kabat numbering system. If your request finds matches, the
sequences that come back have been pre-aligned. You may use these
returned aligned sequences as SUGGESTIONS as to how you might align your
sequence. For example, alignment of the third region of complementarity
for the heavy chains depends on finding where the D minigene region ends
and the J minigene region begins. This is not always possible, and
single base replacements in the rearranged sequence compound the problem.
So, if the codon or amino acid does not match perfectly, where does it
go-- with the D or J? That is a problem you have to work out visually.
Your results then might come back with many different aligned matches
which may be used as aids in the alignment process. Seqhunt was
originally written for this purpose.
Types of Searches
===========================================================================
There are 8 search types allowed through the electronic mail. They are
as follows:
nsa : Nucleotide String Antibody
Search for a pattern match with the desired antibody specificity
for immunoglobulins or classification for TCR.
asa : Amino Acid String Antibody
Search for a pattern match with the desired antibody specificity
for immunoglobulins or classification for TCR.
nsr : Nucleotide String Reference
Search for a pattern match with the desired reference.
asr : Amino Acid String Reference
Search for a pattern match with the desired reference.
nsn : Nucleotide String Name
Search for a pattern match with the desired name.
asn : Amino Acid String Name
Search for a pattern match with the desired name.
All the above searches do not look for exact pattern matches. For
example, if you enter HIV for the search field in an nsa match, all
sequences containing the phrase HIV in the antibody specificity will
be returned.
nm : Nucleotide match
Search for the pattern matches with the target sequence you
supply, with no more than the allowable mismatches you specify.
Both senses of the sequence you send will be searched
automatically.
am : Amino acid match
Search for the pattern matches with the target sequence you
supply, with not more than the allowable mismatches you
specify. YOUR SEQUENCE MUST BE SENT IN SINGLE LETTER CODE.
Restrictions
===========================================================================
To allow restrictions to the searches, the following fields may be tailored
to your specifications. See the "valid restrictions" part of this document
for abbreviations used.
species human, mouse, rabbit, etc. or all
class immunoglobulin, t-cell receptor, mhc, etc. or all
subclass heavy chains, kappa light chains, tcr alpha, etc. or all
In addition to specifying the species, class and subclass, you may,
when allowed (see valid restrictions at the end), search "all" of a field.
For example, you may search {mouse, ig, all} meaning mouse immunoglobulin
heavy, kappa, and lambda chains. Another example would be searching
{all, ig, hc} meaning search all immunoglobulin heavy chains, regardless
of the species. Each field can use the restriction all. One case is not
allowed. You may not specify "all" for all three fields. This would mean
searching all species, all classes, and all subclasses for a match.
At the end of this file are the current allowable restrictions for each
field.
Formatting A Request (IMPORTANT)
===========================================================================
To keep things running as smoothly as possible, there is one format
developed for requests. It might be a good idea to keep a copy of
this for quick reference. Any format deviating from this format will
be discarded. You can put as many requests in a mail message as you
want as long as they are of the correct format.
Format of an E-mail search request of the Kabat database
--------------------------------------------------------
Form Comment
---- -----------------------------
$Begin begin of request
# comment optional one line comment
E-mail address your return e-mail address
Search type nsa,nsr,nsn,nm,asa,asr,asn,am
Species valid species or all
Class valid class or all
Subclass valid subclass or all
Mismatches mismatches allowed for nm,am
Search Pattern pattern to look for in search
$End end of request
All fields in the form must be filled, except the comment field which
is optional.
The $ before begin and end are there for a reason! Please don't
forget to put them in. The $ before end is there so that the routine
can differentiate between 'end' being amino acids and 'end' meaning
"the end".
Here is an example of an e-mail message from someone wanting to look
for the pattern HIV in the antibody specificities of amino acid data.
The restrictions imposed are to look through only human immunoglobulins.
Since the request wants to look through all immunoglobulins (heavy
chains, kappa light chains, lambda light chains), the subclass field
will be "all". The symbol for immunoglobulin is ig. These symbols
can be found at the end of this file. Since this request is not for a
sequence matching search, the mismatches field is not required. To
keep the format though, an "X" will be put in (the point here is
to fill the field with something). Of course remember that the
mismatch field would be important if we were doing a sequence match
(the next example). In this example, the comment line is filled in
with any relevant information you want to associate with the search.
Example request of asa (amino acid string antibody search)
----------------------------------------------------------
$Begin
tt at immuno.esam.nwu.edu return address
# hiv antibodies optional one line comment
asa search type
human species
ig class
all subclass
X mismatches (not used)
HIV target pattern
$End
This next example is a nucleotide sequence match over the mouse ig
kappa chains only, allowing 4 mismatches. The nucleotide sequence
should be free of characters other than atcg. Dashes, periods and
spaces will be removed. You can put n's in for unknown bases or
something else, but make sure it won't be removed.
Example request of nm (nucleotide match)
----------------------------------------
$Begin
tt at immuno.esam.nwu.edu return address
nm search type
mouse species
ig ig class
kappa kappa subclass
4 4 mismatches allowed
tggcccgctagcgcgcgatatatagcg target pattern
$End
In the above example, the target pattern can be much longer of course.
Some mailers only allow 80 characters to be put on a line, so that is
why the target pattern is right at the end. You can put the target
pattern on as many lines as you want. The routine will read each line
and glue them together (taking out spaces, dashes and carriage returns).
You can also put the sequence on one continuous line that wraps around
if you want. Just make sure there is nothing that is not sequence
between your target sequence and the statement "$End".
For amino acid sequence searches, the sequence sent should be in
SINGLE LETTER CODE.
Sending the Request
===========================================================================
Once the form is complete, send off the request to:
seqhunt at immuno.esam.nwu.edu
You should leave the Subject: line blank.
Processing the Request
===========================================================================
Your request will be processed when it is received, and the results will
be send back as soon as the search is performed.
Results of Your Request
===========================================================================
Your request will come back with a header, the date processed, a summary
of the request submitted, and any matches that were found. If only the
header and request summary come back, then either no matches were found
or the format of the request was not correct. Below is the partial output
for an amino acid match search with the restrictions mouse, ig, lambda.
5 mismatches were allowed. Note that although the request was sent in
single letter code, it is converted in the output to triplet code.
Each entry is divided by ~~~~~~. Each entry has the following format.
NAME: name of sequence
SEQ : codon or amino acid sequence (with alignment information)
DIFF: mismatches found (for nuc and a.a. matching)
BEG : match beginning position (Kabat's numbering)
END : match ending position (Kabat's numbering)
ANTI: antibody specificity(s)
REF : sequence reference(s)
TAB : Kabat table sequence is located in
~~~~~~~ end of match
Request Sent (with a comment)
-----------------------------
To: seqhunt at immuno.esam.nwu.edu
Subject:
$Begin
#part of a mouse lambda
tt at immuno.esam.nwu.edu
am
mouse
ig
lambda
5
QAVVTQESALTTSPGGTVILTCRSSTGAVTTSNYANWVQEKPDHLFTGLIGGTSNRAPGVPVRFSGSLIGD
KAALTITGAQTEDDAMYFCALWYSTH
$End
Results returned from the search
--------------------------------
Seqhunt results
================================================================================
Your seqhunt results are in either two or three parts. There are THREE parts
for nucleotide match (nm) and amino acid match (am); all other searches are in
TWO parts. The first two parts are the same for all searches:
Part 1: Summary of the request we received.
Part 2: Matches shown with alignment information.
And for nucleotide/amino acid matches:
Part 3: Matches shown unaligned.
Part 3 contains your search pattern on the top line followed by a listing of
the matches found (in the same order as in part two), with a "." meaning a
perfect match and anything else representing a mismatch. At the end of each
sequence in the figure, the name of the sequence is shown.
If you have any questions or comments about the matches or the Seqhunt output,
please contact either:
George Johnson george at immuno.esam.nwu.edu
Tai Te Wu tt at immuno.esam.nwu.edu
New listings of allowable restrictions can be obtained from the above address
or from ncbi.nlm.nih.gov in the directory /repository/kabat in the file
SEQHUNT_FIELDS.
A complete set of instructions can be obtained by writing to us or by
retrieving the file SEQHUNT_README also in the directory /repository/kabat.
================================================================================
DATE PROCESSED: 04/27/93
Request Received:
Search........: Amino acid sequence match
Species.......: mouse
Sequence class: ig
Subclass......: lambda
Mismatches....: 5
Your Comments.: part of a mouse lambda
Search pattern: qavvtqesalttspggtviltcrsstgavttsnyanwvqekpdhlftgliggtsnrapgvpvr
fsgsligdkaaltitgaqteddamyfcalwysth
Reverse Comp..: Not used
=======
NAME: E20'CL
SEQ : GLN ALA VAL VAL THR GLN GLU SER ALA --- LEU THR THR SER PRO GLY GLY TH
R VAL ILE LEU THR CYS ARG SER SER THR GLY ALA VAL --- --- --- THR THR
SER ASN TYR ALA ASN TRP VAL GLN GLU LYS PRO ASP HIS LEU PHE THR GLY LE
U ILE GLY GLY THR SER ASN ARG ALA PRO GLY VAL PRO VAL ARG PHE SER GLY
SER LEU ILE GLY ASP LYS ALA ALA LEU THR ILE THR GLY ALA GLN THR GLU AS
P ASP ALA MET TYR PHE CYS ALA LEU TRP TYR SER THR HIS
DIFF: 0
BEG : 1
END : 95
ANTI: ANTI-PHOSPHOCHOLINE PROTEIN, p-NITROPHENYL PHOSPHOCHOLINE
REF : CHEN,C.,STENZEL-POORE,M.P. & RITTENBERG,M.B. (1991) J.IMMUNOL.,147,235
9-2367.
TAB : mouselambdalc
~~~~~~
NAME: MOPC315
SEQ : GLN ALA VAL VAL THR GLN GLU SER ALA --- LEU THR THR SER PRO GLY GLY TH
R VAL ILE LEU THR CYS ARG SER SER THR GLY ALA VAL --- --- --- THR THR
SER ASN TYR ALA ASN TRP VAL GLN GLU LYS PRO ASP HIS LEU PHE THR GLY LE
U ILE GLY GLY THR SER ASN ARG ALA PRO GLY VAL PRO VAL ARG PHE SER GLY
SER LEU ILE GLY ASP LYS ALA ALA LEU THR ILE THR GLY ALA GLN THR GLU AS
P ASP ALA MET TYR PHE CYS ALA LEU TRP TYR SER THR HIS
DIFF: 0
BEG : 1
END : 95
ANTI: ANTI-DINITROPHENYL,TRINITROPHENYL,MENADIONE(VITAMIN K3)(BINDING CONSTA
NT=5.4X10EXP5),EPSILON-DNP-L-LYS(BINDING CONSTANT=1.0X10EXP7), 2,4-DIN
ITRONAPHTHOL(BINDING CONSTANT=2.5X10EXP6),EPSILON-DNP-AMINOCAPROATE(BI
NDING CONSTANT=6.7X10EXP6)
REF : DUGAN,E.S.,BRADSHAW,R.A.,SIMMS,E.S. & EISEN,H.N. (1973) BIOCHEMISTRY,1
2,5400-5416. (CHECKED BY AUTHOR); BURSTEIN,Y. & SCHECHTER,I. (1977) BI
OCHEM.J.,165,347-354; GAVISH,M.,ZAKUT,R.,WILCHEK,M. & GIVOL,D. (1978)
BIOCHEMISTRY,17,1345-1351. (CHECKED BY AUTHOR 07/26/79)
TAB : mouselambdalc
~~~~~~
.
.
. etc.
[ I deleted some of the matches to save space ]
Unaligned matches
================================================================================
11 matches found.
* * * * * * * *
QAVVTQESALTTSPGGTVILTCRSSTGAVTTSNYANWVQEKPDHLFTGLIGGTSNRAPGVPVRFSGSLIGDKAALTITGA
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
d....................................i..........................................
.....................................i..........................................
..............r......................i..........................................
................................................................................
................................................................................
*
QTEDDAMYFCALWYSTH
................. E20'CL
................. MOPC315
................. MA8-13
................. TEPC952
................. W230'CL
.............frn. 1-54'CL
.............frn. W108'CL
.............frn. 15-30'CL
................. 163.69'CL
................. 202.17'CL
In sequence matches with mismatches allowed, the mismatches will be shown
in lower case for amino acids (as above), and in upper case for nucleotide
searches. The aligned matches and unaligned matches are in the same order.
The first line of the unaligned matches is your target sequence. The
sequences are compared with the target and exact matches are shown as "."
Anything else is a mismatch. Sometimes, a space will be shown. This is
considered a mismatch (since that base/amino acid) was not known or was
not sequenced.
Help/Comments
===========================================================================
Requests for help and comments can be made to either Dr. Wu or myself.
george at immuno.esam.nwu.edutt at immuno.esam.nwu.edu
Please let us know if you receive wierd messages back, such as ones
saying Memory Fault or other things. These messages mean the program
crashed while attempting to process the request.
The valid search fields table will be located and updated periodically
as things change and deposited at ncbi.nlm.nih.gov in their anonymous ftp
directory /repository/kabat in the file SEQHUNT_FIELDS. If you cannot
find it or don't know how to ftp, send me a message requesting one and I'll
send you a copy of the table.
Also, please be aware that we have PostScript versions of the Kabat
Database and dumps of the Kabat database in the above directory at ncbi.
These files are updated weekly. The PostScript files can be printed on
any PostScript supporting printer. For more information, poke around the
README files in /repository/kabat.
If you have a gopher client, you can gopher to ncbi.nlm.nih.gov and
change into repository kabat.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Valid Search Field Restrictions
===========================================================================
Here are the valid search fields as of 05/01/93. These fields will most
likely be around for a long time, with new additions every now and then.
Be aware though that we might get the urge to shuffle things around, so
make sure the date on this table is not too old.
Here are the abbreviations.
Class
-----
ig immunoglobulin
tcr T-cell receptor for antigen
mhc Major Histocompatibility Complex Class I
iregion Major Histocompatibility Complex Class II
con Constant Regions excluding ig heavy chain
chv Immunoglobulin Heavy Chain constant regions
misc Miscellaneous proteins associated with the immune system
ss Signal sequences of all chains except miscellaneous sequences
miscss Miscellaneous protein signal sequences
Subclass
--------
hc immunoglobulin heavy chains
kappa immunoglobulin kappa chains
lambda immunoglobulin lambda chains
alpha T-cell receptor for antigen alpha chains
beta T-cell receptor for antigen beta chains
gamma T-cell receptor for antigen gamma chains
delta T-cell receptor for antigen delta chains
a MHC class I A-locus
b MHC class I B-locus
c MHC class I C-locus
d MHC class I D-locus
k MHC class I K-locus
dpa MHC class II DP alpha
dpb MHC class II DP beta
dqa MHC class II DQ alpha
dqb MHC class II DQ beta
dra MHC class II DR alpha
drb MHC class II DR beta
aa MHC class II A alpha
ab MHC class II A beta
ea MHC class II E alpha
eb MHC class II E beta
adhe Adhesion Proteins
b2mg Beta-2-Microglobulin
comp Complement
jch J-chains
tsa T-Cell Surface Antigens
thy Thyone
miscp Miscellaneous proteins
When you send a request to Seqhunt, the order for the restrictions is:
species
class
subclass
For ease in locating allowable restrictions, this listing is in a
different order. Make sure you put the restrictions in order though
when you send the request.
Class Subclass Species
----- -------- -------
ig hc mouse
human
cat
chicken
dog
frog
gopher
rabbit
rat
shark
various
ig kappa mouse
human
rabbit
rat
various
ig lambda human
mouse
chicken
horse
rabbit
rat
sheep
various
tcr alpha human
mouse
bovine
rabbit
rat
sheep
various
tcr beta human
mouse
bovine
chicken
rabbit
rat
various
tcr delta human
mouse
rat
sheep
various
tcr gamma human
mouse
rat
sheep
various
mhc a human
mhc b human
mhc c human
mhc various human
mhc d mouse
mhc k mouse
mhc various mouse
mhc various various
iregion dpa human
iregion dpb human
iregion dqa human
iregion dqb human
iregion dra human
iregion drb human
iregion various * human *
iregion aa mouse
iregion ab mouse
iregion ea mouse
iregion eb mouse
iregion various * various *
* includes both alpha and beta chains
con alpha various
con beta various
con gamma various
con delta various
con kappa various
con lambda various
chv hc various
misc adhe various
misc b2mg various
misc comp human
misc various
misc jch various
misc tsa various
misc thy various
misc miscp human
mouse
various
ss hc human
mouse
various
ss kappa human
mouse
various
ss lambda human
mouse
various
ss alpha human
mouse
various
ss beta human
mouse
various
ss gamma human
mouse
various
ss delta human
mouse
various
miscss mhc various
miscss iregion various
miscss adhe various
miscss b2mg various
miscss comp various
miscss tsa various
miscss miscp various