The Kabat database of Sequences of Proteins of Immunological Interest
Introduction
==========================================================================
This file contains information about the kabat directory in the NCBI ftp
repository, as well as it's subdirectories:
/Rel5.0
/ps
/dump
/otherdata
At the bottom of this README is an e-mail address for problems and
suggestions.
Please please please:
If you are confused by the file formats or do not understand something,
please download a sample of each file and take a look at it, print it,
or whatever. They are really easier to understand after looking at them
than by just reading a description of them.
Thank You,
George Johnson
Family business
==========================================================================
All in the Family...
We have had a couple of questions about our 'family' designation for the
dump files and Postscript files. These families do not and are not meant
to correspond to families cited in the literature. They are purely
something generated at our end for the purpose of combining sequences of
high AMINO ACID homology within the variable region together to help us
and others locate sequences that seem to belong with each other more than
with other sequences.
Each family is composed of sequences that differ from one another by 12
amino acids or less. These amino acid differences do not take into
consideration the codons that generated them. It is a division based on
amino acid sequence only. (A glance over the codon sequences though
indicates that the codons are quite similar too).
Each family table or file has a miscellaneous table associated with it.
For example, HUMAN HEAVY CHAINS FAMILY I has a file associated with it
called HUMAN HEAVY CHAINS FAMILY I MISCELLANEOUS. This table contains
sequences that are not complete through the V-region but that do share a
great similarity with sequences in HUMAN HEAVY CHAINS FAMILY I. Because
they are incomplete, these sequences cannot be unambigously assigned to
family I. Also, there are two large tables associated with each CLASS of
sequences, a miscellaneous unknown table and a miscellaneous fragment
table. The unknown table contains sequences which do not fit into a
family and are mostly complete. The fragment table contains sequences
which are incomplete throughout most of the V-region.
Please keep in mind that the family designation is for our purposes and
for purposes of locating similar sequences. There is no relationship
between our families and everyone else's families that we know of.
Dump files
==========================================================================
The directory /dump contains dump files generated from the Kabat Database
of Sequences of Proteins of Immunological Interest.
These files will be regenerated weekly to reflect the new additions and
corrections to the database.
These files will be present while the database is being converted into
ASN.1 format, as an intermediate between the fifth edition of Sequences
of Proteins of Immunological Interest and the ASN.1 files.
Since the close-of-data for printing the Fifth Edition (April 1991), the
database has grown enormously. As of October 1992, the number of amino
acids has increased by 50% for immunoglobulin, while the number of
codons has increased by 100% for immunoglobulin. The other catagories
of sequences have grown at this rate or higher. Because of this, the
BBN-generated files for the Fifth Edition are severely out of date.
The dumpfiles are not in genbank flat format unfortunately, because of
the massive amount of time required to convert the Kabat database table
format to genbank flatfile format. To give interested workers the most
current information possible, we have decided to present the raw data
in dump format. Please read below for a description of the dump
format.
File Naming Convention
==========================================================================
The dump filenames describe the contents of the dumpfiles.
Here is an example:
various.con.hc
The first field shows the species (various species)
The second field shows the sequence type (constant region)
The third field shows the type of constant region (heavy chain)
Another example:
mouse.ig.hc
Mouse immunoglobulin heavy chains
Dump File Format
==========================================================================
Here is an example of one of the entries you would find in the mouse.ig.hc
dump file:
AA TABLE : MOUSE HEAVY CHAIN FAMILY I
NUC TABLE : CODONS OF MOUSE HEAVY CHAINS FAMILY I
AMINO NAME: TF5-139'CL
CODON NAME: TF5-139
REFERENCE : RILEY,S.C.,CONNORS,S.J.,KLINMAN,N.R. & OGATA,R.T. (1986) PROC.NAT.ACAD.SCI.USA,83,2589-2593. (CHECKED BY AUTHOR 08/19/87)
AB SPECIFI:
SPECIES : MOUSE
CLASS : IGA-KAPPA
STRAIN :
SOURCE : NEONATAL SPLEEN CELL HYBRIDOMA
INSERTSAA :
INSERTSNUC:
NOTES AA : FROM BALB/c NEONATAL SPLEEN CELL HYBRIDOMA.
NOTES NUC :
KABAT NUM : 0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|35A|35B|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|52A|52B|52C|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|82A|82B|82C|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|100A|100B|100C|100D|100E|100F|100G|100H|100I|100J|100K|101|102|103|104|105|106|107|108|109|110|111|112|113
AA SEQUEN : ---|GLU|VAL|GLN|LEU|GLN|GLU|SER|GLY|PRO|SER|LEU|VAL|LYS|PRO|SER|GLN|THR|LEU|SER|LEU|THR|CYS|SER|VAL|THR|GLY|ASP|SER|ILE|THR|SER|GLY|TYR|TRP|ASN|---|---|TRP|ILE|ARG|LYS|PHE|PRO|GLY|ASN|LYS|LEU|GLU|TYR|MET|GLY|TYR|ILE|SER|---|---|---|TYR|SER|GLY|SER|THR|TYR|TYR|ASN|PRO|SER|LEU|LYS|SER|ARG|ILE|SER|ILE|THR|ARG|ASP|THR|SER|LYS|ASN|GLN|TYR|TYR|LEU|GLN|LEU|ASN|SER|VAL|THR|THR|GLU|ASP|THR|ALA|THR|TYR|TYR|CYS|ALA|ARG|TRP|ASP|VAL|---|---|---|---|---|---|---|---|---|---|---|TRP|TYR|PHE|ASP|VAL|TRP|GLY|AL
A|GLY|THR|THR|VAL|THR|VAL|SER|SER
NUC SEQ : ---|gag|gtg|cag|ctt|cag|gag|tca|gga|cct|agc|ctc|gtg|aaa|cct|tct|cag|act|ctg|tcc|ctc|acc|tgt|tct|gtc|act|ggc|gac|tcc|atc|acc|agt|ggt|tac|tgg|aac|---|---|tgg|atc|cgg|aaa|ttc|cca|ggg|aat|aaa|ctt|gag|tac|atg|ggg|tac|ata|agc|---|---|---|tac|agt|ggt|agc|act|tac|tac|aat|cca|tct|ctc|aaa|agt|cga|atc|tcc|atc|act|cga|gac|aca|tcc|aag|aac|cag|tac|tac|ctg|cag|ttg|aat|tct|gtg|act|act|gag|gac|aca|gcc|aca|tat|tac|tgt|gca|aga|tgg|gac|gtc|---|---|---|---|---|---|---|---|---|---|---|tgg|tac|ttc|gat|gtc|tgg|ggc|gc
a|ggg|acc|acg|gtc|acc|gtc|tcc|tca
//
Some things to note:
1. The lines are variable length, ended by a line feed (\n).
2. Some fields are empty; they only have a line feed.
3. Each field name is the same length (13 characters)
4. The sequences are aligned.
5. Each codon or amino acid is separated by a |. This is useful if you
have a procedure that can read in text and make an array out of it
using a delineator like the | to indicate each index value.
Some descriptions
AA TABLE-- A simple description of where the sequence came from in
our database. Mouse heavy chains family I means that
this sequence is a mouse immunoglobulin heavy chain
which belongs to family I. A family, by our definition,
is a collection of sequences which differ from one another
by less than twelve amino acid residues.
NUC TABLE-- Nucleotide sequence table name (see AA TABLE)
AMINO NAME- The amino acid sequence name
CODON NAME- The nucleotide sequence name
REFERENCE-- The reference of the paper(s) that these sequences came
from.
SPECIES,
CLASS,
NOTES -- Annotations
INSERTS AA,
INSERTS NUC- For alignment, sometimes codons and amino acids must be
removed from the sequence. When this is done, a #
sign is placed in the sequence where the removal occurred.
The sequence that was removed is placed in these rows.
KABAT NUM-- Kabat's numbering system.
The format of the dump is loose. Some of the different types of
sequences have different annotation fields. All entries have
AA TABLE, NUC TABLE, AMINO NAME, CODON NAME, NOTES AA, NOTES NUC,
KABAT NUM, NUC SEQ, and AA SEQUEN.
Rel5.0
==========================================================================
The Rel5.0 directory contains the 5th edition of Sequences of Proteins of
Immunological Interest. These files reflect the database as of April
1991 and were generated by BBN. The nucleotide data is in Genbank
flatfile format. The amino acid data is in PRF format. There are
separate alignment files. Please see the documents README and kabat.doc
for more information.
This format is unfortunately no longer supported by the Kabat database
project.
PostScript
==========================================================================
The directory /ps contains PostScript files that reflect all the sequence
data in the Kabat database. These files may be printed or viewed on
equipment supporting the PostScript format. The files, when printed,
resemble the printed version of the 5th edition of Sequences of Proteins
of Immunological Interest. These files though are up-to-date and will
remain up-to-date as needed. For those familiar with the printed version
of the book, a few alterations in the reference portion of each amino
acid file will be seen.
A couple of noticable heading alterations
CORRESPONDING CODON: the name of the codon sequence the
translated amino acid sequence was
derived from.
CORRESPONDING MATURE PROTEIN: the name of the amino acid
sequence a signal sequence
is connected to. For signal
sequences only.
Some comments on the files
These files are set-up for printing on LEGAL SIZED PAPER. If printed on
letter sized paper, the printout will be truncated.
Because we want to generate the PostScript files as fast as possible, most
of the analysis and statistics found in the fifth edition are not present
in the PostScript files here.
The immunoglobulin light, heavy, and T-cell receptor for antigen sequences
have been divided into families of greatest homology. Mouse heavy chains
subgroups are now reflected in 55 smaller files, for example.
The immunoglobulin, mhc, and Tcr sequences have been sorted based on the
protein found in column 1. The differences found between all the other
sequences and the one in column 1 are highlighted in bold.
How the files are named (examples)
The files are named by putting together the directory structure to the
file.
/ps/ig/mouse/hc/i.cod : mouse heavy chains family I codons
/ps/ig/mouse/hc/i.aa : mouse heavy chains family I amino acids
/ps/d/mouse/mouse : d-minigenes of mouse (no .aa or .cod;
only one file)
Other Data
==========================================================================
This directory contains other information prepared from the database,
like collections of sequences or parts of sequences used in some analysis
we did. Read the README to find out what is stored in this directory.
Problems and Suggestions
==========================================================================
If you have problems with these files or suggestions, please e-mail the
project at:
George Johnson george at immuno.esam.nwu.edu
T.T. Wu tt at immuno.esam.nwu.edu
Other Resources
==========================================================================
Dan Jacobson has put the dumpfiles on his gopher server at
merlot.welch.jhu.edu. You can gopher there and search the files with
keywords and boolean constructs. Here is information about that from
his "about this gopher" at the site.
>With a gopher client, you can reach here through
>> % gopher merlot.welch.jhu.edu
>>If you have a gopher server, you can add a tunnel to this one
>with the following link:
>> Name=Computational Biology (Johns Hopkins University)
> Host=merlot.welch.jhu.edu
> Port=70
> Type=1
> Path=/
>>-------------------
>>Gopher hole builder:
>>Dan Jacobson
>>danj at welchgate.welch.jhu.edu
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++