Kabat database Readme
geojohn at casbah.acns.nwu.edu
Tue Sep 28 10:50:47 EST 1993
The Kabat database of Sequences of Proteins of Immunological Interest
This file contains information about the kabat directory in the NCBI ftp
repository, as well as it's subdirectories:
At the bottom of this README is an e-mail address for problems and
Please please please:
If you are confused by the file formats or do not understand something,
please download a sample of each file and take a look at it, print it,
or whatever. They are really easier to understand after looking at them
than by just reading a description of them.
All in the Family...
We have had a couple of questions about our 'family' designation for the
dump files and Postscript files. These families do not and are not meant
to correspond to families cited in the literature. They are purely
something generated at our end for the purpose of combining sequences of
high AMINO ACID homology within the variable region together to help us
and others locate sequences that seem to belong with each other more than
with other sequences.
Each family is composed of sequences that differ from one another by 12
amino acids or less. These amino acid differences do not take into
consideration the codons that generated them. It is a division based on
amino acid sequence only. (A glance over the codon sequences though
indicates that the codons are quite similar too).
Each family table or file has a miscellaneous table associated with it.
For example, HUMAN HEAVY CHAINS FAMILY I has a file associated with it
called HUMAN HEAVY CHAINS FAMILY I MISCELLANEOUS. This table contains
sequences that are not complete through the V-region but that do share a
great similarity with sequences in HUMAN HEAVY CHAINS FAMILY I. Because
they are incomplete, these sequences cannot be unambigously assigned to
family I. Also, there are two large tables associated with each CLASS of
sequences, a miscellaneous unknown table and a miscellaneous fragment
table. The unknown table contains sequences which do not fit into a
family and are mostly complete. The fragment table contains sequences
which are incomplete throughout most of the V-region.
Please keep in mind that the family designation is for our purposes and
for purposes of locating similar sequences. There is no relationship
between our families and everyone else's families that we know of.
The directory /dump contains dump files generated from the Kabat Database
of Sequences of Proteins of Immunological Interest.
These files will be regenerated weekly to reflect the new additions and
corrections to the database.
These files will be present while the database is being converted into
ASN.1 format, as an intermediate between the fifth edition of Sequences
of Proteins of Immunological Interest and the ASN.1 files.
Since the close-of-data for printing the Fifth Edition (April 1991), the
database has grown enormously. As of October 1992, the number of amino
acids has increased by 50% for immunoglobulin, while the number of
codons has increased by 100% for immunoglobulin. The other catagories
of sequences have grown at this rate or higher. Because of this, the
BBN-generated files for the Fifth Edition are severely out of date.
The dumpfiles are not in genbank flat format unfortunately, because of
the massive amount of time required to convert the Kabat database table
format to genbank flatfile format. To give interested workers the most
current information possible, we have decided to present the raw data
in dump format. Please read below for a description of the dump
File Naming Convention
The dump filenames describe the contents of the dumpfiles.
Here is an example:
The first field shows the species (various species)
The second field shows the sequence type (constant region)
The third field shows the type of constant region (heavy chain)
Mouse immunoglobulin heavy chains
Dump File Format
Here is an example of one of the entries you would find in the mouse.ig.hc
AA TABLE : MOUSE HEAVY CHAIN FAMILY I
NUC TABLE : CODONS OF MOUSE HEAVY CHAINS FAMILY I
AMINO NAME: TF5-139'CL
CODON NAME: TF5-139
REFERENCE : RILEY,S.C.,CONNORS,S.J.,KLINMAN,N.R. & OGATA,R.T. (1986) PROC.NAT.ACAD.SCI.USA,83,2589-2593. (CHECKED BY AUTHOR 08/19/87)
SPECIES : MOUSE
CLASS : IGA-KAPPA
SOURCE : NEONATAL SPLEEN CELL HYBRIDOMA
NOTES AA : FROM BALB/c NEONATAL SPLEEN CELL HYBRIDOMA.
NOTES NUC :
KABAT NUM : 0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|35A|35B|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|52A|52B|52C|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|82A|82B|82C|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|100A|100B|100C|100D|100E|100F|100G|100H|100I|100J|100K|101|102|103|104|105|106|107|108|109|110|111|112|113
AA SEQUEN : ---|GLU|VAL|GLN|LEU|GLN|GLU|SER|GLY|PRO|SER|LEU|VAL|LYS|PRO|SER|GLN|THR|LEU|SER|LEU|THR|CYS|SER|VAL|THR|GLY|ASP|SER|ILE|THR|SER|GLY|TYR|TRP|ASN|---|---|TRP|ILE|ARG|LYS|PHE|PRO|GLY|ASN|LYS|LEU|GLU|TYR|MET|GLY|TYR|ILE|SER|---|---|---|TYR|SER|GLY|SER|THR|TYR|TYR|ASN|PRO|SER|LEU|LYS|SER|ARG|ILE|SER|ILE|THR|ARG|ASP|THR|SER|LYS|ASN|GLN|TYR|TYR|LEU|GLN|LEU|ASN|SER|VAL|THR|THR|GLU|ASP|THR|ALA|THR|TYR|TYR|CYS|ALA|ARG|TRP|ASP|VAL|---|---|---|---|---|---|---|---|---|---|---|TRP|TYR|PHE|ASP|VAL|TRP|GLY|AL
NUC SEQ : ---|gag|gtg|cag|ctt|cag|gag|tca|gga|cct|agc|ctc|gtg|aaa|cct|tct|cag|act|ctg|tcc|ctc|acc|tgt|tct|gtc|act|ggc|gac|tcc|atc|acc|agt|ggt|tac|tgg|aac|---|---|tgg|atc|cgg|aaa|ttc|cca|ggg|aat|aaa|ctt|gag|tac|atg|ggg|tac|ata|agc|---|---|---|tac|agt|ggt|agc|act|tac|tac|aat|cca|tct|ctc|aaa|agt|cga|atc|tcc|atc|act|cga|gac|aca|tcc|aag|aac|cag|tac|tac|ctg|cag|ttg|aat|tct|gtg|act|act|gag|gac|aca|gcc|aca|tat|tac|tgt|gca|aga|tgg|gac|gtc|---|---|---|---|---|---|---|---|---|---|---|tgg|tac|ttc|gat|gtc|tgg|ggc|gc
Some things to note:
1. The lines are variable length, ended by a line feed (\n).
2. Some fields are empty; they only have a line feed.
3. Each field name is the same length (13 characters)
4. The sequences are aligned.
5. Each codon or amino acid is separated by a |. This is useful if you
have a procedure that can read in text and make an array out of it
using a delineator like the | to indicate each index value.
AA TABLE-- A simple description of where the sequence came from in
our database. Mouse heavy chains family I means that
this sequence is a mouse immunoglobulin heavy chain
which belongs to family I. A family, by our definition,
is a collection of sequences which differ from one another
by less than twelve amino acid residues.
NUC TABLE-- Nucleotide sequence table name (see AA TABLE)
AMINO NAME- The amino acid sequence name
CODON NAME- The nucleotide sequence name
REFERENCE-- The reference of the paper(s) that these sequences came
NOTES -- Annotations
INSERTS NUC- For alignment, sometimes codons and amino acids must be
removed from the sequence. When this is done, a #
sign is placed in the sequence where the removal occurred.
The sequence that was removed is placed in these rows.
KABAT NUM-- Kabat's numbering system.
The format of the dump is loose. Some of the different types of
sequences have different annotation fields. All entries have
AA TABLE, NUC TABLE, AMINO NAME, CODON NAME, NOTES AA, NOTES NUC,
KABAT NUM, NUC SEQ, and AA SEQUEN.
The Rel5.0 directory contains the 5th edition of Sequences of Proteins of
Immunological Interest. These files reflect the database as of April
1991 and were generated by BBN. The nucleotide data is in Genbank
flatfile format. The amino acid data is in PRF format. There are
separate alignment files. Please see the documents README and kabat.doc
for more information.
This format is unfortunately no longer supported by the Kabat d
More information about the Immuno