Kabat Ig Database Survey
hperry at prophet.bbn.com
Tue Jan 7 13:09:37 EST 1992
1992 Kabat Ig Sequence Database Survey
We are pleased to announce that the 5th edition of "Sequences of
Proteins of Immunological Interest" is now available. It is comprised of three
volumes that exceed 2500 pages. Contents include over 400 groupings of related
amino acid and nucleotide sequences of:
- signal sequences of light chains, heavy chains, T-lymphocyte
receptors, related proteins
- variable region of light chains, heavy chains, T-lymphocyte receptors
- constant region
- major histocompatibility antigens Class I
- I region gene products Class II
- related proteins
- D and J minigenes
"Sequences of Proteins of Immunological Interest" (NTIS PB91-192898) is
distributed by the National Technical Information Service: (703) 487-4650
If you're a past computer user of the Kabat Database of "Sequences of
Proteins of Immunological Interest" or might be interested in becoming a new
one, then we'd like to hear from you. We are preparing distribution files of
the new 5th edition of our book and would appreciate your comments. Your
responses will help us make more informed design decisions that ultimately will
result in better database files for your use. Please share this survey with
your colleagues who might not have email access. A regular mail return address
is given at the end.
1992 Kabat Database Survey
I. Database Organization:
The last database distribution corresponded to the 1987 fourth edition
of our book. There were 6 files:
tape.doc (documentation; 717 records)
tape.sum (number of entries in each data
file, broken down by group; 649 records)
aa.seq ("raw" amino acid entries; 42804 records; 3.5 mb)
nuc.seq ("raw" nucleotide entries; 24932 records; 2 mb)
aa.grp (groups of aligned amino acid sequences; 12362 records; 1 mb)
nuc.grp (groups of aligned nucleotide sequences; 13438 records; 1.1 mb)
(a "record" is just a line in a file; "mb" means megabyte, or a million
The entries in the raw files are individual sequences with no alignment
information- just raw sequence data plus associated information. The entries
in the group files are groups of related and aligned sequences. Entries in the
nucleotide files correspond to their associated amino acid sequences, where
available. The new release will be three times larger. Given the descriptions
of the database files, please answer the following with respect to the new
1. Would you use the raw files?
2. Would you use the alignment group files?
3. Should variability statistics be included with amino acid groups of aligned
entries? (For each aligned position, variability statistics include:
number of residues, number of different amino acids, occurrences of most
common amino acid plus the "variability" which is computed as the number
of different amino acids divided by the frequency of the most common amino
4. Would a different database organization be better for your use? For
example, would it be preferable to have a separate data file for each
section in the table of contents in our book:
- Signal sequences of light chains
- Signal sequences of heavy chains
- Signal sequences of t-lymphocyte receptor
- Signal sequences of related proteins
- Variable region light chain sequences
- Variable region heavy chain sequences
- Variable region t-lymphocyte receptor for antigen
- Constant region sequences
- Major histocompatibility antigens class I sequences
- I region gene products class II sequences
- Sequences of related proteins
- D minigenes
- J minigenes
5. In addition to the files described above we plan to include the following
- list of entry titles
- list of new entries
- list of reference numbers
- accession number index
- author name index
- journal citation index
- antibody specificity index
Would you use these? Is there anything missing?
6. The new 5th edition was produced from PostScript files. If you have access
to a PostScript printer for 8.5 x 14 legal size pages (which?) or a
PostScript previewer (which?), and are familiar with our book pages, would
you use these additional files? The 5th edition may be the last printed
edition. Would you be interested in an alternative future release of our
"book" that's in this format?
7. Other organization-related comments:
II. Information Cross Referencing:
Cross referencing related pieces of information facilitates access to
the information represented in the database.
8. Would you use cross referencing between amino acid entries and their
associated nucleotide sequence entries? If so, should they each have the
same entry name?
9. Would you use cross referencing between a raw entry and its associated
group of aligned sequences?
10. Would you use cross referencing between a raw entry and associated entries
in the PIR database?
11. Would you use cross referencing between a raw entry and associated entries
in the GenBank database?
12. Other cross referencing comments:
III. File Entry Formats:
13. We plan to format the raw amino acid sequences in the CODATA format used
for the ascii flat file version of PIR and the raw nucleotide entries in
GenBank format. Are your software tools able to handle these formats? If
not, what formats are you able to use?
14. If you are familiar with the old file formats used in our previous computer
release, would you prefer that the new release be available in these
formats? (The new database is available in these formats. If you're
interested in copies please let us know.)
15. Other formatting comments:
16. Miscellaneous comments:
We will post a message when the new version is available. If you're interested
in serving as a beta test site please let us know.
Thanks again. It's good to hear from you!
Harold Perry (hperry at bbn.com)
Carl Foeller (cfoeller at bbn.com)
BBN Systems and Technologies
10 Moulton St.
Cambridge, MA 02138
More information about the Bioforum