Kabat database info

George Johnson geojohn at casbah.acns.nwu.edu
Tue Aug 17 09:35:55 EST 1993



Here are the README files for the Kabat database.  Most of the database
is archived at NCBI's ncbi.nlm.nih.gov repository site.  FTP there 
anonymously and change directories to /repository/kabat.  You can
also gopher there and to other gophers which support searching through
some of the database formats.  Finally, a method of searching the
database is described.

--------------------------------------------------------------------------

The Kabat database of Sequences of Proteins of Immunological Interest

Introduction
==========================================================================


This file contains information about the kabat directory in the NCBI ftp 
repository, as well as it's subdirectories:

	/Rel5.0
	/ps
        /dump
        /otherdata

At the bottom of this README is an e-mail address for problems and 
suggestions.  

Please please please:  

If you are confused by the file formats or do not understand something, 
please download a sample of each file and take a look at it, print it, 
or whatever.  They are really easier to understand after looking at them
than by just reading a description of them.

Thank You,

George Johnson


Family business
==========================================================================

All in the Family...

We have had a couple of questions about our 'family' designation for the 
dump files and Postscript files.  These families do not and are not meant 
to correspond to families cited in the literature.  They are purely 
something generated at our end for the purpose of combining sequences of 
high AMINO ACID homology within the variable region together to help us 
and others locate sequences that seem to belong with each other more than 
with other sequences.

Each family is composed of sequences that differ from one another by 12 
amino acids or less.  These amino acid differences do not take into 
consideration the codons that generated them.  It is a division based on
amino acid sequence only.  (A glance over the codon sequences though 
indicates that the codons are quite similar too).

Each family table or file has a miscellaneous table associated with it.
For example, HUMAN HEAVY CHAINS FAMILY I has a file associated with it 
called HUMAN HEAVY CHAINS FAMILY I MISCELLANEOUS.  This table contains 
sequences that are not complete through the V-region but that do share a 
great similarity with sequences in HUMAN HEAVY CHAINS FAMILY I.  Because 
they are incomplete, these sequences cannot be unambigously assigned to 
family I.  Also, there are two large tables associated with each CLASS of
sequences, a miscellaneous unknown table and a miscellaneous fragment 
table.  The unknown table contains sequences which do not fit into a 
family and are mostly complete.  The fragment table contains sequences
which are incomplete throughout most of the V-region.  

Please keep in mind that the family designation is for our purposes and 
for purposes of locating similar sequences.  There is no relationship 
between our families and everyone else's families that we know of.


Dump files
==========================================================================

The directory /dump contains dump files generated from the Kabat Database 
of Sequences of Proteins of Immunological Interest.

These files will be regenerated weekly to reflect the new additions and 
corrections to the database.

These files will be present while the database is being converted into 
ASN.1 format, as an intermediate between the fifth edition of Sequences 
of Proteins of Immunological Interest and the ASN.1 files.

Since the close-of-data for printing the Fifth Edition (April 1991), the
database has grown enormously.  As of October 1992, the number of amino 
acids has increased by 50% for immunoglobulin, while the number of 
codons has increased by 100% for immunoglobulin.  The other catagories
of sequences have grown at this rate or higher.  Because of this, the 
BBN-generated files for the Fifth Edition are severely out of date.

The dumpfiles are not in genbank flat format unfortunately,  because of 
the massive amount of time required to convert the Kabat database table
format to genbank flatfile format.  To give interested workers the most 
current information possible, we have decided to present the raw data 
in dump format.  Please read below for a description of the dump 
format.


File Naming Convention
==========================================================================

The dump filenames describe the contents of the dumpfiles.
Here is an example:

various.con.hc

The first field shows the species (various species)
The second field shows the sequence type (constant region)
The third field shows the type of constant region (heavy chain)

Another example:

mouse.ig.hc

Mouse immunoglobulin heavy chains


Dump File Format
==========================================================================

Here is an example of one of the entries you would find in the mouse.ig.hc
dump file:

AA TABLE  :  MOUSE HEAVY CHAIN FAMILY I
NUC TABLE :  CODONS OF MOUSE HEAVY CHAINS FAMILY I
AMINO NAME:  TF5-139'CL
CODON NAME:  TF5-139
REFERENCE :  RILEY,S.C.,CONNORS,S.J.,KLINMAN,N.R. & OGATA,R.T. (1986) PROC.NAT.ACAD.SCI.USA,83,2589-2593.  (CHECKED BY AUTHOR 08/19/87)
AB SPECIFI:  
SPECIES   :  MOUSE
CLASS     :  IGA-KAPPA
STRAIN    :  
SOURCE    :  NEONATAL SPLEEN CELL HYBRIDOMA
INSERTSAA :  
INSERTSNUC:  
NOTES AA  :  FROM BALB/c NEONATAL SPLEEN CELL HYBRIDOMA.
NOTES NUC :  
KABAT NUM :  0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|35A|35B|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|52A|52B|52C|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|82A|82B|82C|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|100A|100B|100C|100D|100E|100F|100G|100H|100I|100J|100K|101|102|103|104|105|106|107|108|109|110|111|112|113
AA SEQUEN :  ---|GLU|VAL|GLN|LEU|GLN|GLU|SER|GLY|PRO|SER|LEU|VAL|LYS|PRO|SER|GLN|THR|LEU|SER|LEU|THR|CYS|SER|VAL|THR|GLY|ASP|SER|ILE|THR|SER|GLY|TYR|TRP|ASN|---|---|TRP|ILE|ARG|LYS|PHE|PRO|GLY|ASN|LYS|LEU|GLU|TYR|MET|GLY|TYR|ILE|SER|---|---|---|TYR|SER|GLY|SER|THR|TYR|TYR|ASN|PRO|SER|LEU|LYS|SER|ARG|ILE|SER|ILE|THR|ARG|ASP|THR|SER|LYS|ASN|GLN|TYR|TYR|LEU|GLN|LEU|ASN|SER|VAL|THR|THR|GLU|ASP|THR|ALA|THR|TYR|TYR|CYS|ALA|ARG|TRP|ASP|VAL|---|---|---|---|---|---|---|---|---|---|---|TRP|TYR|PHE|ASP|VAL|TRP|GLY|AL

A|GLY|THR|THR|VAL|THR|VAL|SER|SER
NUC SEQ   :  ---|gag|gtg|cag|ctt|cag|gag|tca|gga|cct|agc|ctc|gtg|aaa|cct|tct|cag|act|ctg|tcc|ctc|acc|tgt|tct|gtc|act|ggc|gac|tcc|atc|acc|agt|ggt|tac|tgg|aac|---|---|tgg|atc|cgg|aaa|ttc|cca|ggg|aat|aaa|ctt|gag|tac|atg|ggg|tac|ata|agc|---|---|---|tac|agt|ggt|agc|act|tac|tac|aat|cca|tct|ctc|aaa|agt|cga|atc|tcc|atc|act|cga|gac|aca|tcc|aag|aac|cag|tac|tac|ctg|cag|ttg|aat|tct|gtg|act|act|gag|gac|aca|gcc|aca|tat|tac|tgt|gca|aga|tgg|gac|gtc|---|---|---|---|---|---|---|---|---|---|---|tgg|tac|ttc|gat|gtc|tgg|ggc|gc

a|ggg|acc|acg|gtc|acc|gtc|tcc|tca
//

Some things to note:

1.  The lines are variable length, ended by a line feed (\n).
2.  Some fields are empty; they only have a line feed.
3.  Each field name is the same length (13 characters)
4.  The sequences are aligned.
5.  Each codon or amino acid is separated by a |.  This is useful if you
    have a procedure that can read in text and make an array out of it 
    using a delineator like the | to indicate each index value.

Some descriptions

AA TABLE--  A simple description of where the sequence came from in
            our database.  Mouse heavy chains family I means that
            this sequence is a mouse immunoglobulin heavy chain
            which belongs to family I.  A family, by our definition,
            is a collection of sequences which differ from one another
            by less than twelve amino acid residues.

NUC TABLE-- Nucleotide sequence table name (see AA TABLE)

AMINO NAME- The amino acid sequence name
CODON NAME- The nucleotide sequence name
REFERENCE-- The reference of the paper(s) that these sequences came
            from.
SPECIES,
CLASS,
NOTES    -- Annotations
INSERTS AA,
INSERTS NUC- For alignment, sometimes codons and amino acids must be
             removed from the sequence.  When this is done, a #
             sign is placed in the sequence where the removal occurred.
             The sequence that was removed is placed in these rows.
KABAT NUM-- Kabat's numbering system.


The format of the dump is loose.  Some of the different types of
sequences have different annotation fields.  All entries have
AA TABLE, NUC TABLE, AMINO NAME, CODON NAME, NOTES AA, NOTES NUC,
KABAT NUM, NUC SEQ, and AA SEQUEN.


Rel5.0
==========================================================================

The Rel5.


More information about the Immuno mailing list