Kabat Database README

George Johnson george at immuno.esam.nwu.edu
Fri Nov 19 12:04:37 EST 1993


The Kabat database of Sequences of Proteins of Immunological Interest

Introduction
==========================================================================


This file contains information about the kabat directory in the NCBI ftp 
repository, as well as it's subdirectories:

	/Rel5.0
	/ps
        /dump
        /otherdata

At the bottom of this README is an e-mail address for problems and 
suggestions.  

Please please please:  

If you are confused by the file formats or do not understand something, 
please download a sample of each file and take a look at it, print it, 
or whatever.  They are really easier to understand after looking at them
than by just reading a description of them.

Thank You,

George Johnson


Family business
==========================================================================

All in the Family...

We have had a couple of questions about our 'family' designation for the 
dump files and Postscript files.  These families do not and are not meant 
to correspond to families cited in the literature.  They are purely 
something generated at our end for the purpose of combining sequences of 
high AMINO ACID homology within the variable region together to help us 
and others locate sequences that seem to belong with each other more than 
with other sequences.

Each family is composed of sequences that differ from one another by 12 
amino acids or less.  These amino acid differences do not take into 
consideration the codons that generated them.  It is a division based on
amino acid sequence only.  (A glance over the codon sequences though 
indicates that the codons are quite similar too).

Each family table or file has a miscellaneous table associated with it.
For example, HUMAN HEAVY CHAINS FAMILY I has a file associated with it 
called HUMAN HEAVY CHAINS FAMILY I MISCELLANEOUS.  This table contains 
sequences that are not complete through the V-region but that do share a 
great similarity with sequences in HUMAN HEAVY CHAINS FAMILY I.  Because 
they are incomplete, these sequences cannot be unambigously assigned to 
family I.  Also, there are two large tables associated with each CLASS of
sequences, a miscellaneous unknown table and a miscellaneous fragment 
table.  The unknown table contains sequences which do not fit into a 
family and are mostly complete.  The fragment table contains sequences
which are incomplete throughout most of the V-region.  

Please keep in mind that the family designation is for our purposes and 
for purposes of locating similar sequences.  There is no relationship 
between our families and everyone else's families that we know of.


Dump files
==========================================================================

The directory /dump contains dump files generated from the Kabat Database 
of Sequences of Proteins of Immunological Interest.

These files will be regenerated weekly to reflect the new additions and 
corrections to the database.

These files will be present while the database is being converted into 
ASN.1 format, as an intermediate between the fifth edition of Sequences 
of Proteins of Immunological Interest and the ASN.1 files.

Since the close-of-data for printing the Fifth Edition (April 1991), the
database has grown enormously.  As of October 1992, the number of amino 
acids has increased by 50% for immunoglobulin, while the number of 
codons has increased by 100% for immunoglobulin.  The other catagories
of sequences have grown at this rate or higher.  Because of this, the 
BBN-generated files for the Fifth Edition are severely out of date.

The dumpfiles are not in genbank flat format unfortunately,  because of 
the massive amount of time required to convert the Kabat database table
format to genbank flatfile format.  To give interested workers the most 
current information possible, we have decided to present the raw data 
in dump format.  Please read below for a description of the dump 
format.


File Naming Convention
==========================================================================

The dump filenames describe the contents of the dumpfiles.
Here is an example:

various.con.hc

The first field shows the species (various species)
The second field shows the sequence type (constant region)
The third field shows the type of constant region (heavy chain)

Another example:

mouse.ig.hc

Mouse immunoglobulin heavy chains


Dump File Format
==========================================================================

Here is an example of one of the entries you would find in the mouse.ig.hc
dump file:

AA TABLE  :  MOUSE HEAVY CHAIN FAMILY I
NUC TABLE :  CODONS OF MOUSE HEAVY CHAINS FAMILY I
AMINO NAME:  TF5-139'CL
CODON NAME:  TF5-139
REFERENCE :  RILEY,S.C.,CONNORS,S.J.,KLINMAN,N.R. & OGATA,R.T. (1986) PROC.NAT.ACAD.SCI.USA,83,2589-2593.  (CHECKED BY AUTHOR 08/19/87)
AB SPECIFI:  
SPECIES   :  MOUSE
CLASS     :  IGA-KAPPA
STRAIN    :  
SOURCE    :  NEONATAL SPLEEN CELL HYBRIDOMA
INSERTSAA :  
INSERTSNUC:  
NOTES AA  :  FROM BALB/c NEONATAL SPLEEN CELL HYBRIDOMA.
NOTES NUC :  
KABAT NUM :  0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|35A|35B|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|52A|52B|52C|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|82A|82B|82C|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|100A|100B|100C|100D|100E|100F|100G|100H|100I|100J|100K|101|102|103|104|105|106|107|108|109|110|111|112|113
AA SEQUEN :  ---|GLU|VAL|GLN|LEU|GLN|GLU|SER|GLY|PRO|SER|LEU|VAL|LYS|PRO|SER|GLN|THR|LEU|SER|LEU|THR|CYS|SER|VAL|THR|GLY|ASP|SER|ILE|THR|SER|GLY|TYR|TRP|ASN|---|---|TRP|ILE|ARG|LYS|PHE|PRO|GLY|ASN|LYS|LEU|GLU|TYR|MET|GLY|TYR|ILE|SER|---|---|---|TYR|SER|GLY|SER|THR|TYR|TYR|ASN|PRO|SER|LEU|LYS|SER|ARG|ILE|SER|ILE|THR|ARG|ASP|THR|SER|LYS|ASN|GLN|TYR|TYR|LEU|GLN|LEU|ASN|SER|VAL|THR|THR|GLU|ASP|THR|ALA|THR|TYR|TYR|CYS|ALA|ARG|TRP|ASP|VAL|---|---|---|---|---|---|---|---|---|---|---|TRP|TYR|PHE|ASP|VAL|TRP|GLY|AL

A|GLY|THR|THR|VAL|THR|VAL|SER|SER
NUC SEQ   :  ---|gag|gtg|cag|ctt|cag|gag|tca|gga|cct|agc|ctc|gtg|aaa|cct|tct|cag|act|ctg|tcc|ctc|acc|tgt|tct|gtc|act|ggc|gac|tcc|atc|acc|agt|ggt|tac|tgg|aac|---|---|tgg|atc|cgg|aaa|ttc|cca|ggg|aat|aaa|ctt|gag|tac|atg|ggg|tac|ata|agc|---|---|---|tac|agt|ggt|agc|act|tac|tac|aat|cca|tct|ctc|aaa|agt|cga|atc|tcc|atc|act|cga|gac|aca|tcc|aag|aac|cag|tac|tac|ctg|cag|ttg|aat|tct|gtg|act|act|gag|gac|aca|gcc|aca|tat|tac|tgt|gca|aga|tgg|gac|gtc|---|---|---|---|---|---|---|---|---|---|---|tgg|tac|ttc|gat|gtc|tgg|ggc|gc

a|ggg|acc|acg|gtc|acc|gtc|tcc|tca
//

Some things to note:

1.  The lines are variable length, ended by a line feed (\n).
2.  Some fields are empty; they only have a line feed.
3.  Each field name is the same length (13 characters)
4.  The sequences are aligned.
5.  Each codon or amino acid is separated by a |.  This is useful if you
    have a procedure that can read in text and make an array out of it 
    using a delineator like the | to indicate each index value.

Some descriptions

AA TABLE--  A simple description of where the sequence came from in
            our database.  Mouse heavy chains family I means that
            this sequence is a mouse immunoglobulin heavy chain
            which belongs to family I.  A family, by our definition,
            is a collection of sequences which differ from one another
            by less than twelve amino acid residues.

NUC TABLE-- Nucleotide sequence table name (see AA TABLE)

AMINO NAME- The amino acid sequence name
CODON NAME- The nucleotide sequence name
REFERENCE-- The reference of the paper(s) that these sequences came
            from.
SPECIES,
CLASS,
NOTES    -- Annotations
INSERTS AA,
INSERTS NUC- For alignment, sometimes codons and amino acids must be
             removed from the sequence.  When this is done, a #
             sign is placed in the sequence where the removal occurred.
             The sequence that was removed is placed in these rows.
KABAT NUM-- Kabat's numbering system.


The format of the dump is loose.  Some of the different types of
sequences have different annotation fields.  All entries have
AA TABLE, NUC TABLE, AMINO NAME, CODON NAME, NOTES AA, NOTES NUC,
KABAT NUM, NUC SEQ, and AA SEQUEN.


Rel5.0
==========================================================================

The Rel5.0 directory contains the 5th edition of Sequences of Proteins of
Immunological Interest.  These files reflect  the database as of April 
1991 and were generated by BBN.  The nucleotide data is in Genbank 
flatfile format.  The amino acid data is in PRF format.  There are 
separate alignment files.  Please see the documents README and kabat.doc 
for more information.

This format is unfortunately no longer supported by the Kabat database 
project.


PostScript
==========================================================================

The directory /ps contains PostScript files that reflect all the sequence 
data in the Kabat database.  These files may be printed or viewed on 
equipment supporting the PostScript format.  The files, when printed,
resemble the printed version of the 5th edition of Sequences of Proteins 
of Immunological Interest.  These files though are up-to-date and will 
remain up-to-date as needed.  For those familiar with the printed version
of the book, a few alterations in the reference portion of each amino 
acid file will be seen. 

A couple of noticable heading alterations

CORRESPONDING CODON:  the name of the codon sequence the
                      translated amino acid sequence was
                      derived from.


CORRESPONDING MATURE PROTEIN:  the name of the amino acid
                               sequence a signal sequence
                               is connected to.  For signal
                               sequences only.


Some comments on the files


These files are set-up for printing on LEGAL SIZED PAPER.  If printed on 
letter sized paper, the printout will be truncated.

Because we want to generate the PostScript files as fast as possible, most
of the analysis and statistics found in the fifth edition are not present
in the PostScript files here.

The immunoglobulin light, heavy, and T-cell receptor for antigen sequences
have been divided into families of greatest homology.  Mouse heavy chains
subgroups are now reflected in 55 smaller files, for example.

The immunoglobulin, mhc, and Tcr sequences have been sorted based on the 
protein found in column 1.  The differences found between all the other
sequences and the one in column 1 are highlighted in bold.

How the files are named (examples)

The files are named by putting together the directory structure to the 
file.

/ps/ig/mouse/hc/i.cod   :  mouse heavy chains family I codons

/ps/ig/mouse/hc/i.aa    :  mouse heavy chains family I amino acids

/ps/d/mouse/mouse       :  d-minigenes of mouse  (no .aa or .cod;
                           only one file)

Other Data
==========================================================================

This directory contains other information prepared from the database, 
like collections of sequences or parts of sequences used in some analysis
we did.  Read the README to find out what is stored in this directory.


Problems and Suggestions
==========================================================================


If you have problems with these files or suggestions, please e-mail the
project at:

George Johnson       george at immuno.esam.nwu.edu
T.T. Wu              tt at immuno.esam.nwu.edu



Other Resources
==========================================================================

Dan Jacobson has put the dumpfiles on his gopher server at 
merlot.welch.jhu.edu.  You can gopher there and search the files with
keywords and boolean constructs.  Here is information about that from
his "about this gopher" at the site.

>With a gopher client, you can reach here through
>
> % gopher merlot.welch.jhu.edu
>
>If you have a gopher server, you can add a tunnel to this one
>with the following link:
>
> Name=Computational Biology (Johns Hopkins University)
> Host=merlot.welch.jhu.edu
> Port=70
> Type=1
> Path=/
>
>-------------------
>
>Gopher hole builder:
>
>Dan Jacobson
>
>danj at welchgate.welch.jhu.edu

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



More information about the Immuno mailing list