Searching proteins by their digested mass profile

CompBioResGrp cbrg at inf.ethz.ch
Sat Feb 6 16:05:09 EST 1993


A new server is available from the CBRG in Zurich.  A description
of the server follows.  This description can be obtained by
sending e-mail to cbrg at inf.ethz.ch with one line "help MassSearch".

For a description of other servers at the CBRG, send an e-mail with
the line: "help all"

------------------------------------------------------------------
In  some cases, recognition of proteins can be done by fragmenting
the  protein  according to certain pattern and using the molecular
weights of the fragments as a trace.  This method is not effective
to find the composition of an unknown protein, but it is effective
in  locating  an  unknown  sample if its sequence is recorded in a
protein database.

One  of  the  ways  of  breaking  a  protein  into  smaller pieces
according  to  a  certain pattern is by using enzymes which digest
the  protein.  For  example,  trypsin breaks a protein after every
Arginine  (R)  or after every Lysine (K) not followed by a Proline
(P). AspN breaks a protein before every Aspartic acid (D). A table
of recognized enzymes and their cleavage rules is given below.

The  molecular  weight of fragments can be found experimentally by
mass  spectrometry  methods  to  a  good  level of accuracy.  More
importantly, these methods typically require very small samples in
the order of fractions of pico-moles.

The  problem  of  identifying  a sampled protein can be reduced to
digesting  the  protein  with  an  enzyme,  finding  the molecular
weights  of  each  of  the  pieces  and then comparing this set of
weights  to  what  would  be  obtained  from the digestion of each
protein in the database.  The process can be repeated with several
different enzymes to increase its selectivity.

The  function  MassSearch locates the best candidates in a protein
database (SwissProt at this time) that would fit the given weights
once digested by the given enzyme.

This  type  of searching has been found particularly useful in the
following circumstances:

o  To  identify  proteins when the amount available is very small,
   for example as can be separated by 2D gels.
o  To determine whether an unknown protein is already known in the
   database before spending a significant effort in sequencing.
o  To  identify more than one protein which cannot be separated by
   other means (this method has been successfully used to identify
   two proteins which were digested together).

The   template   of  the  body  of  the  message  to  be  sent  to
cbrg at inf.ethz.ch is (between but not including the dashed lines):

---------------------------------------------------------------------
MassSearch
Trypsin: 1524.0, 1509.7, 1387.5, 1169.4, 1014.4, 842.5,
          836.4,  743.2,  717.2,  563.1,  511.3
---------------------------------------------------------------------

The  token  "MassSearch"  indicates  the operation to be run.  The
following  lines  contain the name of the digester enzyme followed
by  the  weights.  The weights can be separated by spaces, commas,
tabs   or   newlines   as  convenient,  but  no  other  extraneous
characters.  Many  different searches can be requested in a single
command, each request must be identified by the name of the enzyme
and followed by the weights.

The output of the above request is:

   Searching on SwissProt version 23.  For each set of
weights, the matching sequences are printed in decreasing
order of significance.  Scores lower than 70 are generally
not significant.

Searching the weights 1524, 1509.7000, 1387.5000, 1169.4000, 1014.4000,
   842.5000 , 836.4000, 743.2000, 717.2000, 563.1000, 511.3000 as
   digested by Trypsin

Score  n k   AC      DE                   OS
159.4 15 9 P80049; FATTY ACID-BINDING PROTEIN, LIVER (FABP). GINGLYMOSTOMA
                    CIRRATUM (NURSE SHARK).
 76.2 28 5 P22966; ANGIOTENSIN-CONVERTING ENZYME PRECURSOR, TESTIS-SPECIFIC (EC
                    3.4.15.1) (ACE) (DIPEPTIDYL CARBOXYPEPTIDASE I) (KININASE
                    II). HOMO SAPIENS (HUMAN).
 72.4 11 4 P16291; COAGULATION FACTOR IX (EC 3.4.21.22) (CHRISTMAS FACTOR)
                    (FRAGMENT). OVIS ARIES (SHEEP).
 72.3 25 2 P18416; TRANSPOSASE (TRANSPOSON TN552) (ORF 480). STAPHYLOCOCCUS
                    AUREUS.
 71.0  5 6 P08821; DNA-BINDING PROTEIN II (HB) (HU). BACILLUS SUBTILIS, AND
                    BACILLUS GLOBIGII.
 66.9 23 7 P13214; ANNEXIN IV (LIPOCORTIN IV) (ENDONEXIN I) (CHROMOBINDIN 4)
                    (PROTEIN II) (P32.5) (PLACENTAL ANTICOAGULANT PROTEIN II)
                    (PAP-II) (PP4-X) (35-BETA CALCIMEDIN). BOS TAURUS (BOVINE).
 . . . . .

The  first  column  measures  the quality of the match between the
given  weights and a protein sequence in the database.  The higher
the score, the better the match. The hits are listed in decreasing
scoring  order.  The second column, identified by n, indicates the
number  of  fragments  that  will result from the digestion of the
found protein.  The third column, identified with k, indicates the
number  of  given  weights which were successfully matched against
the theoretical digestion.  The score is calculated from the total
number of fragments, the number of given weights matched, and from
how  closely  these  weights  could be matched.  The fourth column
indicates  the accession number of the sequence in SwissProt.  The
rest  of  each  line  contains  the description and species of the
sequence which serve as a quick guide to identify the protein.

A  complete  description  of  the  algorithm  and  the probability
foundations can be found in chapter 20 of "A tutorial introduction
to  computational  biochemistry  using  the Darwin system" by G.H.
Gonnet.

The  boundary  between  insignificant  and  significant matches is
around  70.  Scores  less  than 70 are not very significant, while
scores greater than 70 are significant.

The  enzymes  which  are presently recognized, and the names to be
used, are the following (courtesy of Amos Bairoch)

Enzyme name           cuts between                    except for
###########           ############                    ##########

Armillaria            Xaa-Cys,Xaa-Lys
ArmillariaMellea      Xaa-Lys
BNPS_NCS              Trp-Xaa
Chymotrypsin          Trp-Xaa,Phe-Xaa,Tyr-Xaa,        Trp-Pro,Phe-Pro,Tyr-Pro,
                      Met-Xaa,Leu-Xaa,                Met-Pro,Leu-Pro
Clostripain           Arg-Xaa
CNBr_Cys              Met-Xaa,Xaa-Cys
CNBr                  Met-Xaa
AspN                  Xaa-Asp
LysC                  Lys-Xaa
Hydroxylamine         Asn-Gly
MildAcidHydrolysis    Asp-Pro
NBS_long              Trp-Xaa,Tyr-Xaa,His-Xaa
NBS_short             Trp-Xaa,Tyr-Xaa
NTCB                  Xaa-Cys
PancreaticElastase    Ala-Xaa,Gly-Xaa,Ser-Xaa,Val-Xaa
PapayaProteinaseIV    Gly-Xaa
PostProline           Pro-Xaa                         Pro-Pro
Thermolysin           Xaa-Leu,Xaa-Ile,Xaa-Met,
                      Xaa-Phe,Xaa-Trp,Xaa-Val
TrypsinArgBlocked     Lys-Xaa                         Lys-Pro
TrypsinCysModified    Arg-Xaa,Lys-Xaa,Cys-Xaa         Arg-Pro,Lys-Pro,Cys-Pro
TrypsinLysBlocked     Arg-Xaa                         Arg-Pro
Trypsin               Arg-Xaa,Lys-Xaa                 Lys-Pro
V8AmmoniumAcetate     Glu-Xaa                         Glu-Pro
V8PhosphateBuffer     Asp-Xaa,Glu-Xaa                 Asp-Pro,Glu-Pro

Please report any problems with the server to

	knecht at inf.ethz.ch



More information about the Comp-bio mailing list