Release 40.00 of PIR-International Protein Sequence Database

Mon May 2 12:47:17 EST 1994

                Announcements of the Protein Information Resource
                                   2 May 1994

1. PIR-International Protein Sequence Database Release 40.00
2. Summary of Database Developments in Release 40.00
3. The March 1994 ATLAS of Protein and Genomic Sequences CD-ROM
4. The Complex Carbohydrate Structure Database and CarbBank

1. PIR-International Protein Sequence Database Release 40.00

Release 40.00 of the PIR-International database, Release 14.00 of the NRL_3D
Database (corresponding to Brookhaven Protein Data Bank Release 65), and 
Release 5.1 of the PIR ALN Database of Protein Sequence Alignments are now
available through the PIR On-line System and the PIR Network Request Server.
The PIR1, PIR2, PIR3, and NRL_3D databases are distributed on tape, and those
databases plus the ALN Database are distributed on CD-ROM.

Database   Release Sequences  Residues
PIR1       40.00   12,227     4,454,283   Classified and Annotated Entries
PIR2       40.00   34,147     9,362,019   Annotated Entries
PIR3       40.00   21,049     5,930,994   Unverified Entries
NRL_3D     14.00    2,722       484,598   Protein Sequences in Brookhaven PDB
ALN         5.1     1,133 Entries         Protein Sequence Alignments

The NRL_3D Database contains protein sequences extracted from the Brookhaven
Protein Data Bank (PDB) coordinate data files. Introduced by the PIR in 1990
as an interface between the Protein Sequence Database and the PDB, it was the
first sequence database providing access to the PDB data via computerized
sequence searching and comparison methods. The ALN Database contains multiple
sequence alignments of selected protein sequences from the PIR-International
Protein Sequence Database.

Growth of the PIR databases is documented in the file DBGROWTH.LIS available
through the PIR Network Request Server. The following files are also available
through the Server:
  PADD.LIS      PIR1 and PIR2 entries added since Release 39.00
  PREV.LIS      PIR1 and PIR2 entries with revised sequences since Release 39.00
  SPECIES.LIS   species recorded in PIR1 and PIR2
  SUPERFAM.LIS  superfamiles recorded in PIR1 and PIR2
  KEYWORDS.LIS  keywords employed in PIR1 and PIR2
  FEATURES.LIS  features catalogued in PIR1 and PIR2
  JOURNALS.LIS  recognized journal abbreviations
  ALNBASE.LIS   a description of the ALN Database
  ALNTITLE.LIS  titles in the ALN Database
  NRLTITLE.LIS  titles in the NRL_3D Database

To obtain these and other files from the PIR Network Request Server, requests
should be sent to:
  FILESERV at NBRF.Georgetown.Edu

2. Summary of Database Developments in Release 40.00

The enhanced NBRF format was introduced with release 39.00. These format
enhancements were undertaken in order to
(1) improve the coverage, accuracy, and completeness of the PIR-International
    Protein Sequence Database, 
(2) provide additional data fields and define them more precisely so that
    conversions to other formats or database systems (RDBMS or OODBMS) can
    be accomplished more easily,
(3) make the overall presentation more uniform for human readability and
    more computer parsable to facilitate automatic checking for correct format,
    syntax, and vocabulary within the database, and
(4) make the two flat file distribution formats of the PIR-International,
    the NBRF format and the CODATA format, more completely interconvertible
    without any degradation of information.

Because we realized that planned changes could cause software problems if our
users were not given advance notice, we set up a developers mailing list and
began issuing the PIR Technical Development Bulletin. The fourth Bulletin
documented the changes that would be introduced with the enhanced NBRF Format
in Release 39.00. It is available in the file PIRTECH.LIS, which can be sent by
the PIR Network Request Server or picked up by anonymous FTP from the UH 
Gene-Server,, IP address This electronic bulletin
provides detailed specifications of the database format and serves as an "early
warning system" for software developers and others who are concerned about
changes in the format and standards for the PIR databases. If you are
interested in the technical aspects of these database changes and would like to
be placed on the mailing list for the Technical Bulletin, send a brief
electronic mail note to
  POSTMASTER at NBRF.Georgetown.Edu.

Descriptions of the CODATA Exchange Format and of PIR feature annotations can
be obtained from the PIR Network Request Server in the files CXFSD.LIS and
FEATDOC.LIS respectively.

3. The March 1994 ATLAS of Protein and Genomic Sequences CD-ROM

The new release of the ATLAS of Protein and Genomic Sequences CD-ROM is now
available for distribution.

The ATLAS Information Retrieval program provides direct and simultaneous
retrieval from the databases included on the CD-ROM or on mounted secondary
CD-ROMs. In this release of the ATLAS CD-ROM, versions of the ATLAS program are
provided for these operating systems:
  OpenVMS Alpha AXP,
  DEC OSF/1 Alpha AXP,
  SGI/IRIX, and

The ATLAS program provides a user-friendly environment where entries from
selected databases can be linked dynamically for simultaneous retrieval on
biological annotations and bibliographic information, such as protein names,
superfamily names, homology domains, organism names, gene names, keywords,
feature descriptions, author's names, etc. The ATLAS program also enables
selected sets of sequences to be searched directly both for exact subsequences
or for patterns. A complete and comprehensive Installation and User's Guide is
provided on the CD-ROM and the ATLAS program itself contains an integrated help

The ATLAS CD-ROM contains specially configured versions of the FASTA programs
that allow the protein sequence databases on the CD-ROM to be searched by
sequence directly. These programs will execute on PC-DOS, VAX/VMS, and DEC
ULTRIX systems.

The ATLAS CD-ROM includes:
  - PIR1, PIR2, PIR3, NRL_3D, and ALN data sets
  - release 39.06 of the MIPS PATCHX data set
  - release 2.1 of the JIPID ECOLI (Escherichia coli) Nucleic Acid Sequence
  - release 81.0 of the NCBI-GenBank Genetic Sequence Databank GBNEW data set
  - indexes for release 81.0 of the NCBI-GenBank Genetic Sequence Databank
  - release 8 of Complex Carbohydrate Structure Database

The MIPS PATCHX data set has been assembled from a collection of other public
domain protein sequence databases. When used in conjunction with the MIPS
PATCHX data set, the Protein Sequence Database provides the most complete
collection of protein sequence data currently available in the public domain.

The ECOLI Nucleic Acid Sequence Database compiled by scientists at JIPID and
NBRF is a comprehensive, nonredundant, fully merged (all recognized contigs are
assembled into single sequence segments), and annotated database containing
sequence information from the GenBank, EMBL, and NBRF nucleic acid sequence
databases, plus information entered directly from published reports. Protein
coding regions are annotated in the feature tables, as are additional features
such as promoter regions, Shine-Dalgarno sequences, and transcription
termination sequences. The protein coding regions are directly cross-referenced
to the PIR-International Protein Sequence Database and features are formatted
to allow direct translation by computer. Overlapping sequences are merged and
ordered by map position. When their orientation is known, sequence segments are
represented in the same direction (the plus strand). Genetic map positions are
directly correlated with the Kohara physical map using an algorithm developed
by Kunisawa and coworkers that compares restriction fragment lengths, directly
incorporating information on restriction site distances while avoiding site
inversion problems.

Because of its size it is no longer possible to include all of the GenBank
Sequence Databank on the ATLAS CD-ROM. All of the GBNEW dataset is provided and
the LOCUS and TITLE information is available for the 14 other datasets.
However, index files for the NCBI-GenBank Genetic Sequence Databank release
81.0 are provided so that for VAX/VMS and MS-DOS systems with multiple CD-ROM
drives the ATLAS program can access the NCBI-GenBank Sequence Databank mounted
on a secondary CD-ROM drive.

Through the cooperation of CarbBank, the Complex Carbohydrate Structure
Database (CCSD) and its associated CarbBank software are now included on the
Atlas of Protein and Genomic Sequences CD-ROM. The ATLAS CD-ROM includes
documentation and an Installation Manual and Tutorial for CarbBank. The ATLAS
program cannot access the CCSD. The CCSD and CarbBank are discussed in more
detail in the next section.

Orders for the ATLAS CD-ROM are accepted, WITHOUT PREPAYMENT, on institutional 
purchase orders, by FAX or E-mail. For further information in the US and the
Americas, please contact:

                Kathryn Sidman, Technical Services Coordinator
                      Protein Information Resource (PIR)
                National Biomedical Research Foundation (NBRF)
                           3900 Reservoir Rd., NW
                              Washington DC 20007
                             FAX: (202) 687-1662
                            phone: (202) 687-2121
                     E-mail: PIRMAIL at
                             PIRMAIL at gunbrf.bitnet

In Europe contact:
              Martinsried Institute for Protein Sequences (MIPS)
                    Max-Planck-Institute for Biochemistry
                          8033 Martinsried, Germany
                             FAX:  49 89 8578 2655
                            phone: 49 89 8578 2657
                   E-mail: mewes at

In Asia and Oceania contact:
           Japan International Protein Information Database (JIPID)
                         Science University of Tokyo
                        2669 Yamazaki, Noda 278 Japan
                             FAX:  81 47 122 1544 
                            phone: 81 48 124 1501
                       E-mail: Tsugita at JPNSUT31.BITNET

4. The Complex Carbohydrate Structure Database and CarbBank

This release of the ATLAS CD-ROM includes the Complex Carbohydrate Structure
Database (CCSD) release 8 and CarbBank version 2.5. The CCSD is a database that
contains complex carbohydrate structures and associated text information
derived from scientific publications. The database has a flat file format.
Structural abbreviations and nomenclature are similar to those found in the
journal Carbohydrate Research. CarbBank is the computer management system for
CCSD database files. CarbBank runs on PC- or MS-DOS, IBM-compatible
microcomputers, and has a menu-driven user interface. CarbBank has an Editor
that allows you to create or modify database records and a Searcher that will
let you find records based on Search Criteria that you supply. A Report
generation facility allows the user to create a variety of reports on the
contents of databases, and an Interchange module allows CarbBank to view
reports and to exchange records among ASCII text files, a CarbBank-specific
version of the CCSD, and an ASN.1 version of the CCSD.

The CarbBank program cannot operate from a floppy diskette, from a CD-ROM, or
from a write-protected disk. There are other minimum system and hardware
requirements. Please consult the CarbBank documentation or CarbBank before
attempting to install this software on your PC.

For information about CarbBank contact:
         Dana Smith
         CarbBank/CCSD Manager
         114 W. Magnolia St.
         Suite 305
         Bellingham, WA 98225, USA
         Phone:     (206) 733-7183
         FAX:       (206) 733-7283
         EMail:     Internet: 76424.1122 at Compuserve.Com

Inquiries about how to obtain the PIR-International Protein Sequence Database:

         Ms. Katie Sidman
         PIR Technical Services Coordinator
         National Biomedical Research Foundation
         3900 Reservoir Road NW
         Washington DC 20007
         Phone:      (202) 687-2121
         FAX:        (202) 687-1662
         EMail:      PIRMAIL at
                                Dr. Winona C. Barker, Director
                                Protein Information Resource
                                National Biomedical Research Foundation
                                Washington DC 20007
                                BARKER at

