Announcements of PIR Network Request Service

POSTMASTER at NBRF.Georgetown.Edu POSTMASTER at NBRF.Georgetown.Edu
Tue Nov 3 14:48:23 EST 1992


               Announcements of the Protein Information Resource
                            Network Request Service

Highlights
1. Hints for Retrieving Sequence Database Entries
2. PIR Network Request Service Command Summary


1. Hints for Retrieving Sequence Database Entries

The very first thing to appreciate about the sequence databases is that the
most commonly sought information for every entry is contained in the title
field.  The title field contains the protein name, the source organism and the
EC number if it's an enzyme.  On the other hand, the keyword field contains
information that does not necessarily duplicate what is in the title.  What the
keyword field is designed to do is provide ancillary retrieval information that
is not conveyed in a protein name, such information as
  * disease or resistance states associated with the protein
    ACQUIRED IMMUNE DEFICIENCY SYNDROME or CYANATE RESISTANCE
  * metabolic roles or pathways
    CALCIUM TRANSPORT or PENTOSE PHOSPHATE PATHWAY
  * posttranslation modification processes
    HYDROXYLATION
  * tissues, cell types or subcellular components that are the origin of or
    the targets of the protein
    HEART, LEUKOCYTE or MITOCHONDRIAL MATRIX
  * structural characteristics
    TRIMER or ZINC FINGER
  * larger classification schemes the protein may fall in
    SERINE PROTEASE or STRUCTURAL PROTEIN
Most searches are for information contained in the title field.  The most
common reason for a keyword search failure is that the protein name is what is
being used and that can be found in the title, not the keyword list.  A list of
the keywords found in the current public distribution release of the PIR can be
obtained by using the command
   SEND KEYWORDS
The keywords used by the PIR correspond closely to the MESH terms of the
National Library of Medicine.  

When only one field is being searched, all the words that follow the field name
must be found in the same entry for there to be a "hit".  This means that all
the words on one command line form a logical AND; a QUERY that repeats the same
field connected by AND is unnecessary.  Furthermore since the title combines
both the protein name and the source organism, the title and species can be
searched in a single command; for example,
   QUERY
   TITLE ALPHA
   AND
   TITLE HEMOGLOBIN
   AND
   SPECIES HUMAN
   END QUERY
can be simply combined as
   TITLE HUMAN ALPHA HEMOGLOBIN
On the other hand OR operations are just equivalent to combining the results
of several different searches; for example
   QUERY
   TITLE HUMAN ALPHA HEMOGLOBIN
   OR
   TITLE HUMAN DELTA HEMOGLOBIN
   END QUERY
would achieve the same result as the two separate TITLE searches.

The Boolean operators must be placed on separate lines and not on the line
with another command; for example,
   TITLE CYTOCHROME AND P450
will fail because only entries with the character string "AND" in the title
along with "CYTOCHROME" and "P450" will hit.
   TITLE CYTOCHROME P450
means "search for titles containing both strings 'CYTOCHROME' and 'P450'
in either order".  Double quotation marks can be used to change the meaning
slightly
   TITLE "CYTOCHROME P450"
means "search for titles containing the string 'CYTOCHROME P450' ".  The
double quotation marks must be used when some part of the search string
is less than 3 characters long; for example,
   TITLE "CYTOCHROME C"
The Boolean NOT command can be used most effectively to remove entries
with names that are extensions of some shorter name of interest; for example,
   QUERY
   TITLE "CYTOCHROME C "
   NOT
   TITLE OXIDASE
   NOT
   TITLE REDUCTASE
   END QUERY
will pretty much eliminate everything but cytochrome C from the resulting list. 
(Because the indexing scheme used by the retrieval program lumps together all
the nonalphanumeric characters, the space appearing after the "C" and before
the double quotation mark eliminates entries like "cytochrome c2" but not
"cytochrome c'" from the list.)

One very inappropriate type of request is the following.
   GENE CONCANAVALIN
   KEYWORD CONCANAVALIN
   FEATURE CONCANAVALIN
   TITLE CONCANAVALIN
   SEARCH CONCANAVALIN
Specifically, "concanavalin" is not a gene name, so it will not be found
in the gene field.  The word "concanavalin" is plain text, not a sequence,
so it should not appear after the SEARCH command --- only actual sequences
should appear after a SEARCH command.  While "concanavalin" might possibly
appear in the keyword or feature fields, its use there would be very
specialized and not indicative of a concanavalin entry.  The only command
that makes any sense is
  TITLE CONCANAVALIN
The biggest problem comes when the SEARCH command is used in that way.  The
futile FASTA search this generates wastes shared computer resources that can
be used by others much more fruitfully.  The FASTA program has been modified
to recognize some occurrences of plain text and print a warning.

The USE command is used to restrict searches to particular databases or to
entries added or modified within a particular time period.  Such restrictions
apply to all subsequent search commands in the same request and need not be
used only in queries.

After a successful search, the GET command should be used to retrieve the
actual text of an entry.  The format of the GET command is either
   GET database:code
 or simply
   GET code
There are no spaces around the colon and only one code may follow each GET
command.

There are a few special considerations to keep in mind when using the NRL_3D
database of sequence information extracted from the Brookhaven Protein Data
Bank.  Only these fields in NRL_3D are indexed and can be searched through the
PIR Server:  TITLE, SPECIES, FEATURE and the sequence.  At this time the TITLE
field consists of the COMPND records from the Brookhaven Protein Data Bank file
as well as the species.  In most cases your search will be for something in
this TITLE or name field.  For example, after an initial
   USE BASES NRL_3D
the command
   TITLE MYOGLOBLIN
will retrieve a list of all the myoglobin sequences in the PDB and
   SPECIES MOUSE
will retrieve a list of all the mouse sequences.  The SPECIES field is not 100%
accurate for the NRL_3D because of some eccentricities in the SOURCE records of
the PDB used to construct it.  Although there is a KEYWORD field in NRL_3D
entries, it is constructed directly from the PDB HEADER record and is not
indexed.

With release 10.00 of NRL_3D the PIR will cease converting all of each PDB 
release.  Instead only new and modified entries will be converted; the NRL_3D
entries will gradually be modified to standardize spelling, capitalization,
nomenclature, taxonomy and keywords.  With this standardization the KEYWORD
field will become more meaningful and probably be indexed within the coming
year.


2. PIR Network Request Service Command Summary

The National Biomedical Research Foundation Protein Information Resource
network request service is a full-function fileserver and database query
system.  Operating since August 1990 it is capable of handling database
queries, sequence searches and sequence submissions, in addition to
fileserver requests.  To use this server, request commands should be sent to
FILESERV at GUNBRF on BITNET or FILESERV at NBRF.Georgetown.EDU on Internet.
The server recognizes the following commands sent either in a mail message,
or (if the sender is on BITNET) in a command message or a file:

  Command        Action
  -------        -----------------------------------------------
  ACCESSION      list entry codes and titles by accession number
  AND            combine QUERY commands with Boolean AND
  AUTHOR         list entry codes and titles by author
  BASES          list accessible databases
  CROSS          list PIR entry codes and titles corresponding to
                   a particular nucleic sequence database entry
  DEPOSIT        deposit entry for database submission
    END DEPOSIT  terminate deposit entry
  FEATURE        list entry codes and titles by feature table entry
  GENE           list entry codes and titles for a gene name
  GET            return entry by entry code
  HELP           return HELP instructions
  HOST           list entry codes and titles by host species
  INDEX          list SENDable files
  JOURNAL        list entry codes and titles by journal citation
  KEYWORD        list entry codes and titles by keyword
  MEMBER         list alignments containing entry code as a member
  NOT            combine QUERY commands with Boolean NOT
  OR             combine QUERY commands with Boolean OR
  QUERY          begin collecting QUERY commands
    END QUERY    terminate collecting commands and execute QUERY
  QUIT           ignore the remaining text (E-mail signature blocks)
  RETURN         change return address for gateway mail
  SEARCH         search for matching sequences by FASTA procedure
    END SEARCH   terminate sequence for searching
  SEND           send file
  SPECIES        list entry codes and titles by species
  SUGGEST        leave suggestion or correction for PIR staff
    END SUGGEST  terminate suggestion text
  SUPERFAMILY    list entry codes and titles by superfamily name
  TAXONOMY       report taxonomy for scientific or common name
  TITLE          list entry codes and titles by title
  USE            set databases, dates or formats to use in limited searches

Multiple com


More information about the Bionews mailing list