Announcements of PIR Network Request Service

POSTMASTER at NBRF.Georgetown.Edu POSTMASTER at NBRF.Georgetown.Edu
Thu Oct 21 16:50:16 EST 1993


               Announcements of the Protein Information Resource
                            Network Request Service

Highlights
1. Summaries for PIR-International Release 38.00, NRL_3D Release 13.02, and
   ALN Release 4.00
2. Summary of Database Developments in Release 38.00
3. ATLAS CD-ROM offers Expanded Platform Support
4. Homology Domain Superfamilies and
   Standardization of Homology Domain Features
5. Technical Development Bulletin Details Format Changes for Release 39
6. PIR Network Request Service Command Summary


Announcements
1. Summaries for PIR-International Release 38.00, NRL_3D Release 13.02, and
   ALN Release 4.00

Release 38.00 of the PIR-International database, and Release 13.02 of the
NRL_3D database (corresponding to Brookhaven Protein Data Bank Release 63)
are now available through the PIR On-line system and the Network Request
Server.  The PIR1, PIR2, PIR3 and NRL_3D databases have been distributed
on tape and CD-ROM.

Database   Release Sequences  Residues
PIR1       38.00   11,706     4,129,053   Classified and Annotated Entries
PIR2       38.00   31,952     8,892,588   Annotated Entries
PIR3       38.00   17,590     5,001,183   Unverified Entries
NRL_3D     13.02    1,686       294,964   Sequences in Brookhaven PDB

Growth of the PIR databases is documented in the file DBGROWTH.LIS available
through the Network Request Server.  The following files are also available
through the Server:
  PADD.LIS      PIR1 entries added since Release 37.00
  PREV.LIS      PIR1 entries with revised sequences since Release 37.00
  SUPERFAM.LIS  superfamiles recorded in PIR1 and PIR2
  KEYWORDS.LIS  keywords employed in PIR1 and PIR2
  FEATURES.LIS  features cataloged in PIR1 and PIR2
  JOURNALS.LIS  recognized journal abbreviations
  ALNBASE.LIS   a description of the ALN database
  ALNTITLE.LIS  titles in the ALN database
  NRLTITLE.LIS  titles in the NRL_3D Database
To obtain these and other files from the PIR Network Request Server, follow the
instructions in the last section of these announcements.


2. Summary of Database Developments in Release 38.00

In the last year we intensified our effort to improve the appearance and
consistency of the PIR-International database, and we are gratified that
recent research papers have specifically mentioned employing the PIR features
annotations in the sequence analysis.  In section 4 is a summary of the
revised definition and implementation of homology domain superfamilies and
standardization of homology domain features in the PIR-International database.
Here are some of the other improvements that can be found in the database as a
result of this effort.

(1) All explicit disulfide bond information previously found in free-text
comments has been converted to appropriate feature records.

(2) Except for the special cases of "selenocysteine" and "N-formylmethionine",
standard 3-letter residue codes explicitly appear in these features:
  Active site, Binding site, Cleavage site, Cross-link, Inhibitory site, and
  Modified site.

(3) All features for covalent binding sites contain the explicit "(covalent)"
label.  These features include the bound moieties carbohydrate, phosphate,
sulfate, and many others.

(4) All binding sites for calcium, copper, iron, metal clusters, and heme have
been classified into covalent (for sulfido cysteines) and noncovalent (for 
other dative ligand or predominantly ionic bonds).  Futhermore, these sites
have been recombined where necessary so that there is a one-to-one
correspondence between binding sites and feature records.

(5) The comment "in mature form" appears for amino-terminal features that are
not at the first position and for carboxyl-terminal features that are not at
the last position of those entries that present a full, precursor sequence.

(6) The experimental status of most of the covalent features, including all
the unique interchain disulfide bonds, is explicitly provided in the feature
records.  There are four different status indicators:
 experimental, predicted, absent, and atypical.
These status indicators appear in most of these features:
 Active site, Binding site, Cleavage site, Cross-link, Disulfide bonds,
 Inhibitory site, and Modified site.
These status indicators may also appear in these features:
 Domain, Peptide, and Protein.
In the "Domain:" feature a status indicator is not used for homology domains
or for self-evident features or features with arbitrary designations.

The "(experimental)" status means that the feature has been experimentally
observed in the indicated way at the indicated location.

The "(predicted)" status means that the nature, or the location, or both,
of the feature has been predicted by some means and confirmatory experimental
evidence is apparently unavailable.

The "(absent)" status is used to indicate a feature, otherwise predicted by
some means, that has been experimentally determined not to occur at the
indicated position.  It is intended to be used in the very limited cases when
an investigation of the specific feature produced the experimental result.

The "(atypical)" status is used to indicate a feature that does not follow the
"normal" pattern, or that would otherwise be predicted not to occur, but that
has been experimentally determined to occur at the indicated location.  Again,
it is intended to be used in the very limited cases when an investigation of
the specific feature produced this result.

A description of the PIR features annotations is available in a file that can
be obtained from the PIR Network Request Server through the command
  SEND FEATURES.DOC 
This file is based on the instructions to PIR annotators and documents the
standardization of features records achieved through release 38.

The large and steadily increasing number of protein structure modifications
that require standardized annotation in the PIR-International Protein Sequence
Database has led us to construct a prototype database of modified amino acid
residues to assist in producing appropriate features annotations for covalent
binding sites, modified sites and cross-links.  It is designed as a
PIR-International text database and presently contains over 140 entries
describing features annotated in the Protein Sequence Database.  For each
modified residue it provides a systematic chemical name, frequently observed
alternate names, the Chemical Abstracts Service registry number of the free
residue, the residue atomic formula and weight, the original amino acids that
are modified, indicators for whether the modification is amino-terminal,
carboxyl-terminal or a peptide chain cross-link, and the appropriate
feature annotation in the sequence database.

It is anticipated that in its full, regular release this database will also
provide appropriate literature citations, keywords, and a means for predicting
atomic weights of modified peptides and fragments selected from the sequence
database.  This database can be provided upon request for research purposes  to
those who can receive it by Internet transmission.  It can be accessed by XQS
as a text database after completing an installation procedure or in a limited
way through the TYPE command without perfoming an installation.  If you have a
direct Internet address and would like to receive this prototype database, send
a brief electronic mail note to POSTMASTER at NBRF.Georgetown.Edu.

Each PIR entry applies to only one species.  Each "Species" record now contains
only one species specification.  This change completes the project for
"decombining" PIR entries.  In the future additional entries may be decombined
as genome research demonstrates that particular nonidentical sequences are the
products of different genes rather than heterozygosity of the same gene.

All NRL_3D entries now carry titles that conform to NBRF naming rules and match
corresponding PIR entries.  The title field of NRL_3D entries contains all the
same elements as the corresponding PIR entries and in the same order.  However,
additional elements are present in the NRL_3D entries to distinguish entries
that may be chains or fragments with different crystallographic coordinates,
have different crystallization conditions, or have different chemical
modifications.  The NRL_3D title of an entry may no longer correspond to the
Brookhaven Protein Data Bank COMPND record from which it was originally
derived.  In particular, all Enzyme Commission numbers have been changed to
conform to the current rules, and all co-crystallized protein chains from
different sources are distinguished and correctly identified.


3. ATLAS CD-ROM offers Expanded Platform Support

The new release of the ATLAS of Protein and Genomic Sequences CD-ROM is now
available for distribution.

The ATLAS CD contains versions of the PIR-International Protein Sequence
Database (release 38.00) and the GenBank Sequence Data Bank (release 78.0).  In
conjunction with the MIPS PATCHX data set (assembled from a collection of other
public domain protein sequence databases and also included on the CD-ROM), the
Protein Sequence Database provides the most complete collection of protein
sequence data currently available in the public domain.  This release of the
PIR-International Protein Sequence Database is comprehensively cross-referenced
to the MedLine abstracts by the MedLine Unique Identifier (MUID) and contains
cross-references to the Genome Data Base (GDB) of the Welch Medical Library at
the Johns Hopkins University.

Also provided on the CD-ROM are: release 13.02 of NRL_3D Structure-Function
Database, release 4.0 of the PIR Alignment Database (March 1993), and the March
1993 release of the JIPID ECOLI (Escherichia coli) Nucleic Acid Sequence
Database.  The NRL_3D Database is a protein sequence database extracted from
the Brookhaven Protein Data Bank (PDB) coordinate data files; it provides an
interface between the Protein Sequence Database and the PDB and provides access
to the PDB data via computerized sequence searching and comparison methods. 
The ALN database provides a set of multiple sequence alignments of closely
related protein sequences from the PIR-International Protein Sequence Database. 
The ECOLI Nucleic Acid Sequence Database is a comprehensive, nonredundant,
fully merged (all recognized contigs are assembled into single sequence
segments), and annotated Escherichia coli genomic sequence database.  All
entries in this dataset are directly linked to the corresponding protein
sequence products in the PIR-International Protein Sequence Database.

Included on the ATLAS CD-ROM is the ATLAS Information Retrieval program that
provides direct and simultaneous retrieval from all of the databases on the
CD-ROM.  This program is also featured on the CD-ROM produced in cooperation
with the journal Protein Science and the Protein Society.  In this release of
the ATLAS CD-ROM, versions of the ATLAS program are provided for
  PC-DOS,
  VMS (VAX and Alpha AXP),
  DEC (RISC) ULTRIX,
  SunOS,
  SGI/IRIX, and
  Macintosh
operating systems.  Support will be added for Alpha AXP/OSF systems in the
near future.

The ATLAS program provides an effective alternative to the Entrez program of
the National Center for Biotechnology Information (NCBI).  The ATLAS program is
designed on the principle that the sequence database annotations (protein
names, superfamily names, organism names, gene names, keywords, feature
descriptions, author's names, etc.) provide meaningful information that can be
used to query the database directly.  These data provide direct links between
the nucleic acid and protein sequence database entries and entries in other
specialized data sets.  The ATLAS program provides an environment where data
entries from various databases can be linked dynamically by simultaneous
retrieval on these biological and bibliographic descriptors.

The program presents a command interface modeled on the DEC Command Language
(DCL) of the VMS operating system.  The "command/modifier" interface recognizes
truncated versions of the commands and modifiers.  The ATLAS command language
is similar to that employed in the NBRF PSQ and NAQ programs.  Those familiar
with these systems will experience very little difficulty in adapting to this
new program.  A menu interface is provided for PC-DOS systems.  A complete and
comprehensive Installation and User's Guide is provided on the CD-ROM and the
ATLAS program itself contains an integrated help facility.

ATLAS allows simultaneous retrieval on any selected subset (or all) of the
databases on the CD-ROM.  The user may select any combination of fields to
query on.  For example, a single query command will allow retrieval on the
TITLE and KEYWORDS fields of the GenBank and PIR-International databases. 
Queries can be refined by Boolean combination of sequential database queries. 
Queries are evaluated by an efficient substring searching algorithm.  For
example, a search on the term "globin" will retrieve the complete set of
hemoglobin, leghemoglobin, alpha-globin, beta-globin, myoglobin, and various
other globin and globin-like sequences.  This logic alleviates difficulties
resulting from usage of varying or nonstandard biological terminology within
the different databases. 

The ATLAS CD-ROM also contains specially configured versions of the FASTA 
program that allow the protein sequence databases on the CD-ROM to be searched
(by sequence) directly.  These programs will execute on PC-DOS, VAX/VMS, and
DEC ULTRIX systems.

Orders for the ATLAS CD-ROM are accepted, WITHOUT PREPAYMENT on institutional 
purchase orders, by FAX or E-mail.  For further information in the US and the
Americas, please contact:

                Kathryn Sidman, Technical Services Coordinator
                      Protein Information Resource (PIR)
                National Biomedical Research Foundation (NBRF)
                           3900 Reservoir Rd., NW
                              Washington DC 20007
                             FAX: (202) 687-1662
                            phone: (202) 687-2121
                     E-mail: PIRMAIL at nbrf.georgetown.edu
                             PIRMAIL at gunbrf.bitnet

In Europe contact:

              Martinsried Institute for Protein Sequences (MIPS)
                    Max-Planck-Institute for Biochemistry
                          8033 Martinsried, Germany
                             FAX:  49 89 8578 2655
                            phone: 49 89 8578 2657
                   E-mail: mewes at ehpmic.mips.biochem.mpg.de

In Asia and Oceania contact:

           Japan International Protein Information Database (JIPID)
                         Science University of Tokyo
                        2669 Yamazaki, Noda 278 Japan
                             FAX:  81 47 122 1544 
                            phone: 81 48 124 1501
                       E-mail: Tsugita at JPNSUT31.BITNET


4. Homology Domain Superfamilies and
   Standardization of Homology Domain Features

When the concept of a protein superfamily was introduced by Margaret O. Dayhoff
in the mid-1970's, the nearly 500 completely sequenced proteins then known were
each assigned to one of 116 superfamilies, thus partitioning the database into
unrelated nonoverlapping groups.  The subsequent recognition of multidomain
proteins whose component domains have separate evolutionary origins has made
the original approach inappropriate.  Moreover, the term "superfamily" has been
used in the literature with several different meanings, as has the term
"domain." 

We have developed a formal model for the superfamily concept that allows
sequence homology-based partitioning of the Protein Sequence Database into
domain superfamilies that are closed under transitivity.  In this model, a
sequence can contain overlapping homology domains; sequence domains are defined
as subsequences and are distinct when they correspond to different
subsequences, even when they overlap or when one is contained within another.
The domain consisting of the complete sequence is called the "homeomorphic"
domain and the corresponding superfamily is a homeomorphic superfamily.  In
database entries, homeomorphic superfamily classification will be indicated by
name and/or placement number and domain superfamily classification will be
indicated by a name that includes the word "homology," with the corresponding
homology domains optionally identified as features. 

Using this model, we have begun to convert current placement groups in the
database into homeomorphic superfamilies, to develop mechanisms to assign
well-characterized wild-type sequences into these superfamilies, and to
identify members of domain superfamilies.  In Release 38.0, over 23,000 entries
are assigned to 3072 homeomorphic superfamilies and 142 domain superfamilies.
Placement numbers, previously available for only PIR1, are now being assigned
to classified entries in PIR2.  In many cases, placement numbers for PIR2
entries only give the first (superfamily), or the first and second (family) of
the 5 numbers used for detailed placement in PIR1; thus, the placement number
assigned to an entry may not be unique unless all 5 numbers are specified.

                Standardization of homology domain features

The "Domain:" feature identifier has been used to distinguish several types of
sequence regions: functional (transit peptide, cofactor binding domain),
topological (extracellular, transmembrane, intracellular), structural
(endonexin fold, coiled coil), and regions of sequence homology both within the
same sequence and with other sequences.  The annotation of topological regions
and signal and transit peptides has been standardized.  Currently we are
standardizing the annotation of homology domains in line with the refined
definition of homology domain superfamilies.  For our purposes, a homology
domain is a region of sufficient size to have a characteristic three-
dimensional structure.  It also must have been found in "different" proteins,
that is, proteins that also have sequence region(s) that are not related. 
Inasmuch as "homology" means we are assuming a common evolutionary ancestry of
a given type of domain, we do not currently include regions of restricted
composition that may have arisen independently.  All proteins that contain a
certain type of homology domain can be characterized as members of a domain
superfamily, where it is understood that "common evolutionary ancestry" applies
to the common homology domains and not necessarily to any other regions of the
sequences.  For convenience, we give the same name to the homology domain and
to the corresponding domain superfamily.

Homology domains are named by some descriptive phrase ending with "homology" or
"repeat homology".  Usually the descriptive phrase will include the name of a
protein that contains the domain (calmodulin repeat homology); sometimes a
class name is used (lipocalin homology); and some domains have unique names
that are in common use (kringle homology).  There are several characteristics
of homology domains that should be kept in mind:

(1) By definition they are identified by sequence similarity; thus they are
intrinsically always "predicted" and never "experimental".  For this reason,
no status is explicitly given (unless the domain is "atypical").

(2) The boundaries given are somewhat arbitrary; others may include more or
less residues when they define the corresponding domain.  We will endeavor to
make the boundaries of a given type of domain consistent from entry to entry
rather than using the varying boundaries given by different authors. Usually we
will try to locate a boundary at or with reference to a well- conserved
sequence feature.

(3) The presence of a homology domain does NOT necessarily imply that the
region actually forms the typical structure or performs the typical function. 
For example, not all calmodulin repeat homology domains can adopt the E-F hand
conformation or bind calcium.

(4) A domain will be labelled "(fragment)" if the sequencing is incomplete.

(5) A domain will be labelled "(atypical)" if only a partial copy of the domain
is present.  This status can also be used when there is a sizable insertion or
deletion in the domain or when it is missing sufficient of the defining
characteristics as to make it difficult to identify or to align with the more
typical examples of the domain.

(6) Each homology domain must be given as a separate feature, the location of
which is usually specified by a single pair of residue numbers separated by a
hyphen.  The form "2-25,60-100" will be used in conjunction with the
"(atypical)" status to indicate the regions that align with the typical domain
of that type when the sequence contains a sizable insertion.

(7) The presence of a homology domain feature requires that the corresponding
homology domain superfamily name appear in the Superfamily record but the
converse is not true.  The superfamily classification is a property of the
"conceptual complete sequence".  All members of a PIR placement group (called a
"homeomorphic superfamily") will carry the same superfamily description even
when the sequence shown is fragmentary and does not contain all of the homology
domains.  The standardization of the Superfamily records within homeomorphic
superfamilies is expected to be complete in Release 39. The creation of feature
records for homology domains is an ongoing project.


5. Technical Development Bulletin Details Format Changes for Release 39

The fourth PIR-International Technical Development Bulletin is available in the
file PIRTECH.LIS that can be sent by the PIR Network Request Server or picked
up by anonymous FTP from the UH Gene-Server, ftp.bchs.uh.edu, IP address
129.7.2.43.  This electronic bulletin provides detailed specifications of the
database format and serves as an "early warning system" for software developers
and others who are concerned about changes in the format and standards for the
PIR databases.  The fourth Bulletin documents the changes to be introduced with
the Enhanced NBRF Format in Release 39.00.  If you are interested in the
technical aspects of these database changes and would like to be placed on the
mailing list for the Technical Bulletin, send a brief electronic mail note to
POSTMAST at GUNBRF on BITNET or to POSTMASTER at NBRF.Georgetown.Edu on Internet.


6. PIR Network Request Server Command Summary

The National Biomedical Research Foundation Protein Information Resource
Network Request Server is a full-function fileserver and database query system.
Operating since August 1990 it is capable of handling database queries,
sequence searches and sequence submissions, in addition to fileserver requests.
To use this server, request commands should be sent to
  FILESERV at GUNBRF on BITNET or
  FILESERV at NBRF.Georgetown.EDU on Internet.
The server recognizes the following commands sent either in a mail message
or (if the sender is on BITNET) in a command message or a file:

  Command        Action
  -------        -----------------------------------------------
  ACCESSION      list entry codes and titles by accession number
  AND            combine QUERY commands with Boolean AND
  AUTHOR         list entry codes and titles by author
  BASES          list accessible databases
  CROSS          list PIR entry codes and titles corresponding to
                   a particular nucleic sequence database entry
  DEPOSIT        deposit entry for database submission
    END DEPOSIT  terminate deposit entry
  FEATURE        list entry codes and titles by feature table entry
  GENE           list entry codes and titles for a gene name
  GET            return entry by entry code
  HELP           return HELP instructions
  INDEX          list SENDable files
  JOURNAL        list entry codes and titles by journal citation
  KEYWORD        list entry codes and titles by keyword
  MEMBER         list alignments containing entry code as a member
  NOT            combine QUERY commands with Boolean NOT
  OR             combine QUERY commands with Boolean OR
  QUERY          begin collecting QUERY commands
    END QUERY    terminate collecting commands and execute QUERY
  QUIT           ignore the remaining text (E-mail signature blocks)
  RETURN         change return address for gateway mail
  SEARCH         search for matching sequences by FASTA procedure
    END SEARCH   terminate sequence for searching
  SEND           send file
  SPECIES        list entry codes and titles by species
  SUGGEST        leave suggestion or correction for PIR staff
    END SUGGEST  terminate suggestion text
  SUPERFAMILY    list entry codes and titles by superfamily name
  TAXONOMY       report taxonomy for scientific or common name
  TITLE          list entry codes and titles by title
  USE            set databases, dates, lengths or formats to use in
                   limited searches

Multiple commands can be sent with one command on each line of a mail message
or file.  Commands should NOT be sent on the Subject line of a mail message.
Receipt of BITNET command messages and files will be acknowledged immediately.
Mail messages will be acknowledged by return mail.

For help in using any of the commands, send a request of the form
  HELP topic
for example
  HELP SEARCH

In addition to the commands, help instructions are also available on the
following topics:
  Custom_Services
  Databases
  FTP
  Gateway_Access
  Help_en_Espanol
  Help_en_francais
  Hints
  IBM-VM_BITNET
  On-Line_Access
  PIR_Distribution
  VAX-VMS_BITNET

Because of network gateway communication protocols, there are limitations on
requests sent through gateways.  Users not on BITNET or INTERNET who access the
server through local or network gateways should read and carefully follow these
instructions before sending requests.  Only mail message requests (not command
messages or files) can be sent through gateways.  Because addresses posted on
gateway mail do not always work for the return, before you send requests
through network gateways it is strongly recommended that you first contact
John S. Garavelli (POSTMAST at GUNBRF on BITNET, POSTMASTER at NBRF.Georgetown.EDU on
Internet).  We will confirm a return address for you and may instruct you to
use the RETURN command to ensure that your request output will reach you.  It
is not usually necessary to do this if you are on BITNET or INTERNET, unless
your system employs a local remailer or your mail program applies a
nonstandard return address (for example a personal name on the FROM: line).

The BITNET network and the network gateways impose strict limits on file size.
Poorly posed database queries may result in output so extensive that it could
not be returned by network mail.  Therefore, an output limit of 1000 lines for
each command and 3000 lines for each request is imposed by the PIR server.

The DEPOSIT and QUERY commands, and the SEARCH and SUGGEST commands (in their
multiline form) must be followed by their respective END commands after the
text appearing on the intervening lines.  The DEPOSIT command requires, and the
SEARCH command optionally uses, parameters that appear on the same line as the
command.  Because these four commands are so complex, users should obtain and
carefully read the help instructions before attempting to use them.

The databases available through the PIR Network Server and their abbreviations
for code specification are as follows:
  Abbreviation  Database                              Update Schedule
  PIR1          PIR Classified and Annotated Entries  weekly
  PIR2          PIR Annotated Entries                 weekly
  PIR3          PIR Unverified Entries                weekly
  ALN           PIR Alignment Entries                 quarterly
  NRL_3D        Brookhaven Data Bank Sequences        quarterly
  PATCHX        MIPS PIR-Supplementary Database       quarterly
  N             NBRF Nucleic
  GB*           GenBank (TM)                          as received
  GBNEW         GenBank (TM) New Entries              weekly
  EMBL*         EMBL                                  as received

In the FASTA output of the SEARCH command the abbreviation for PATCHX is
shortened to PATX and NRL_3D is shortened to NR3D; the longer abbreviations
should be used to retrieve entries with the GET command.  Not all commands
work with all databases; please read the information returned by the command
HELP DATABASES.
The GenBank (TM), GB, and EMBL databases are divided into sections
corresponding to the sections of their standard releases.  These databases
may be indivually accessed with the USE BASES command with the database
abbreviation and the section abbreviation, for example
  USE BASES GBPRI
or all sections of a given database may be accessed with the database
abbreviation and an asterisk, for example
  USE BASES PIR*
or
  USE BASES GB*
------------------------------------------------------------------------
                                 Dr. John S. Garavelli
                                 Database Coordinator
                                 Protein Information Resource
                                 National Biomedical Research Foundation
                                 Washington, DC  20007
                                 POSTMAST at GUNBRF.BITNET
                                 POSTMASTER at NBRF.Georgetown.Edu



More information about the Bionews mailing list