How can I search for a motif?

Hassan BADRANE hbadrane at
Mon Oct 17 05:23:43 EST 1994


	There is program called PROSEARCH which look in the prosite "database of
motifs" for the motifs similar to your sequences. I have used this program in unix
systeme. You may also have access directly to this database by WWW in the address:
here following the help file for this program that I hope will clarify more:

Prosearch reads in a file containing one or more protein sequences and
searches for patterns, described sites, or structures in the Prosite
Database compiled by Amos Bairoch.  The output is sent to the named

The input protein sequence file can be in any reasonable format.

The output is a table of sites, followed by the relevant sections from
the Prosite Database. 

The Prosite Database is updated with every release of the SwissProt
database, (about every three months). 


	Over the past year or so Amos Bairoch (bairoch at earn.cgecmu51)
has released an number of versions of his Prosite database.  This is a
database of patterns which have been associated with particular
enzymatic activities or structures.  For example, the well known pattern
for N-link glycosylation Asn-Xxx-Ser/Thr. 

	Amos has compiled a database that consists of references about
each pattern, validity of the patterns, occurrences, and a host of other
details.  This database is of general use, and has been used by Amos in
his PC/Gene Suite of programs for analysis of DNA and Protein sequences. 

	I wanted to use this database on a Unix machine and be able to
ask the question, "Which of these patterns occur in sequence X?"

	This is the second release of Prosearch.  It completely
supersedes the first version with one important bug fix, and support for
VMS, MS-DOS, and UNIX.  Also, by using ReadSeq, a fine program from Don
Gilbert <gilbertd at>, more protein data formats are


	Most patterns can be expressed as regular expressions.  For
example the pattern '^P' when used with the unix utility grep matches
any line in the input that begins with a 'P'. 

	I translated all but 1 of the 337 patterns in Prosite to Unix
style regular expressions and wrote a simple searching program to search
a protein sequence for their occurrence.  The pattern I did not
translate was the pattern PS0003 which is Tyrosine Sulfation.  There is
no clean pattern for this modification. 

	The program is written in the Awk language, and runs on machines
which have either Nawk from AT&T, Gawk from the Free Software
Foundation, or one of several versions of Awk which run on MSDOS


	Input file are any protein sequence files in an unstructured
format.  AWK will accept the input on any number of lines of any length
(I've tried proteins sequences up to 2500 amino acids on one line with
no problem).  Each ASCII character will be interpreted as an amino acid,
and all letters must be capitalized.  With 'readseq' any of a number of
formats can be used. 

GCG-format files are accepted as input sequences.
Sequences with no sort of header are NOT accepted. If you have a raw sequence,
then add a comment line, as below:

>pep23a from my library C1


	There are two possible forms of output.  The "short" form is a
table of accession numbers, positions in the sequence and short names
for patterns.  The "long" form is the same except that the relevant
sections from the Prosite Database is also printed.  At the HGMP-RC you
will be given the long form. 

Here is an example of the output for E.  coli chloramphenicol
transferase III:

Prosite Database -- Release 5.0 of April 1990 Copyright: Amos Bairoch
ProSearch Software -- Release 1.1 -- Copyright: Lee Kolakowski
The following patterns are in < ct.pep >:

Access#     From->To    Name                    Doc#
_______     ________    ____________________    _________
PS00001         2->6    ASN_GLYCOSYLATION       PDOC00001
PS00005       31->34    PKC_PHOSPHO_SITE        PDOC00005
PS00006         4->8    CK2_PHOSPHO_SITE        PDOC00006
PS00006       32->36    CK2_PHOSPHO_SITE        PDOC00006
PS00006     102->106    CK2_PHOSPHO_SITE        PDOC00006
PS00006     113->117    CK2_PHOSPHO_SITE        PDOC00006
PS00100     178->184    CAT                     PDOC00093
PS00100     204->210    CAT                     PDOC00093
* N-glycosylation site *

It has been known for a long time [1] that potential N-glycosylation sites are
specific to the consensus sequence Asn-Xaa-Ser/Thr.  It must be noted that the
presence of the consensus  tripeptide  is  not sufficient  to conclude that an
asparagine residue is glycosylated, due to  the fact that the  folding of  the
protein plays an important  role in the  regulation of N-glycosylation [2].  A
recent study [3] has  shown that the presence of a proline  either between the
Asn  and the Ser/Thr or  C-terminal to  the Ser/Thr  will  completely suppress

-Consensus pattern: N-{P}-[ST]-{P}
                    [N is the glycosylation site]
-Last update: June 1988 / First entry.

[ 1] Marshall R.D.
     Annu. Rev. Biochem. 41:673-702(1972).
[ 2] Pless D.D., Lennarz W.J.
     Proc. Natl. Acad. Sci. U.S.A. 74:134-138(1977).
[ 3] Bause E.
     Biochem. J. 209:331-336(1983).
* Protein kinase C phosphorylation site *

In vivo, protein kinase C  exhibits  a  preference  for the phosphorylation of
serine or threonine residues  close to a  C-terminal basic residue [1,2].  The
presence of additional  basic residues at the  N- or C-terminal of  the target
amino acid enhances the Vmax and Km of the phosphorylation reaction.

-Consensus pattern: [ST]-x-[RK]
                    [S or T is the phosphorylation site]
-Last update: June 1988 / First entry.

[ 1] Woodget J.R., Gould K.L., Hunter T.
     Eur. J. Biochem. 161:177-184(1986).
[ 2] Kishimoto A., Nishiyama K., Nakanishi H., Uratsuji Y., Nomura H.,
     Takeyama Y., Nishizuka Y.
     J. Biol. Chem. 260:12492-12499(1985).
* Casein kinase II phosphorylation site *

Casein kinase II (CK-2) is a protein serine/threonine kinase that has activity
independent of cyclic nucleotides and of calcium.   This enzyme phosphorylates
many different proteins.  The substrate  specificity of this  enzyme [1,2] can
be summarized as follows:

    (1) Under comparable conditions Ser is favoured over Thr.
    (2) An acidic residue  (either Asp or Glu)  must be present three residues
        to the C-terminal of the phosphate acceptor site.
    (3) Additional acidic residues in positions +1, +2, +4 and +5 increase the
        phosphorylation rate.  Most physiological substrates have at least one
        acidic residue in these positions.
    (4) Asp is preferred to Glu as the provider of acidic determinants.
    (5) A basic residue to the N-terminal  of the  acceptor site decreases the
        phosphorylation rate, while an acidic one will increase it.

-Consensus pattern: [ST]-x(2)-[DE]
                    [S or T is the phosphorylation site]
-Note: this pattern  is found  in all  of the  known  physiological substrates
 except in the high mobility group protein 14,  where  an alanine replaces the
 acidic  residue  in  position +3.   However, the phosphorylation rate of this
 substrate is very low.
-Last update: January 1989 / First entry.

[ 1] Marin O., Meggio F., Marchiori F., Borin G., Pinna L.A.
     Eur. J. Biochem. 160:239-244(1986).
[ 2] Kuenzel E.A., Mulligan J.A., Sommercorn J., Krebs E.G.
     J. Biol. Chem. 262:9136-9140(1987).
{PS00100; CAT}
* Chloramphenicol acetyltransferase active site *

Chloramphenicol acetyltransferase (CAT) (EC catalyzes the Acetyl-COA
dependent acetylation  of the antibiotic chloramphenicol  [1], an inhibitor of
prokaryotic peptidyltransferase  activity.  Acetylation  of chloramphenicol by
CAT inactivates  the antibiotic.  An histidine residue plays a central role in
the catalytic mechanism of the enzyme. We use a conserved hexapeptide sequence
around the catalytic residue as a signature pattern for this type of enzyme.

-Consensus pattern: H-H-x-V-C-D
                    [The second H is the active site residue]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.
-Last update: January 1989 / First entry.

[ 1] Murray I.A., Hawkins A.R., Keyte J.W., Shaw W.V.
     Biochem. J. 252:173-179(1988).


	This code is covered by the Free Software Foundation's Gnu
Public License.

Frank Kolakowski 

|lfk at                     ||      Lee F. Kolakowski    |
|lfk at                   ||	M.I.T.		     |
|kolakowski at                ||	Dept of Chemistry    |
|lfk at		        ||	Room 18-506	     |
|lfk at                     ||	77 Massachusetts Ave.|
|AT&T:  1-617-253-1866                  ||	Cambridge, MA 02139  |



More information about the Bio-soft mailing list