Searching proteins by their digested mass profile
cbrg at inf.ethz.ch
Sat Feb 6 16:05:09 EST 1993
A new server is available from the CBRG in Zurich. A description
of the server follows. This description can be obtained by
sending e-mail to cbrg at inf.ethz.ch with one line "help MassSearch".
For a description of other servers at the CBRG, send an e-mail with
the line: "help all"
In some cases, recognition of proteins can be done by fragmenting
the protein according to certain pattern and using the molecular
weights of the fragments as a trace. This method is not effective
to find the composition of an unknown protein, but it is effective
in locating an unknown sample if its sequence is recorded in a
One of the ways of breaking a protein into smaller pieces
according to a certain pattern is by using enzymes which digest
the protein. For example, trypsin breaks a protein after every
Arginine (R) or after every Lysine (K) not followed by a Proline
(P). AspN breaks a protein before every Aspartic acid (D). A table
of recognized enzymes and their cleavage rules is given below.
The molecular weight of fragments can be found experimentally by
mass spectrometry methods to a good level of accuracy. More
importantly, these methods typically require very small samples in
the order of fractions of pico-moles.
The problem of identifying a sampled protein can be reduced to
digesting the protein with an enzyme, finding the molecular
weights of each of the pieces and then comparing this set of
weights to what would be obtained from the digestion of each
protein in the database. The process can be repeated with several
different enzymes to increase its selectivity.
The function MassSearch locates the best candidates in a protein
database (SwissProt at this time) that would fit the given weights
once digested by the given enzyme.
This type of searching has been found particularly useful in the
o To identify proteins when the amount available is very small,
for example as can be separated by 2D gels.
o To determine whether an unknown protein is already known in the
database before spending a significant effort in sequencing.
o To identify more than one protein which cannot be separated by
other means (this method has been successfully used to identify
two proteins which were digested together).
The template of the body of the message to be sent to
cbrg at inf.ethz.ch is (between but not including the dashed lines):
Trypsin: 1524.0, 1509.7, 1387.5, 1169.4, 1014.4, 842.5,
836.4, 743.2, 717.2, 563.1, 511.3
The token "MassSearch" indicates the operation to be run. The
following lines contain the name of the digester enzyme followed
by the weights. The weights can be separated by spaces, commas,
tabs or newlines as convenient, but no other extraneous
characters. Many different searches can be requested in a single
command, each request must be identified by the name of the enzyme
and followed by the weights.
The output of the above request is:
Searching on SwissProt version 23. For each set of
weights, the matching sequences are printed in decreasing
order of significance. Scores lower than 70 are generally
Searching the weights 1524, 1509.7000, 1387.5000, 1169.4000, 1014.4000,
842.5000 , 836.4000, 743.2000, 717.2000, 563.1000, 511.3000 as
digested by Trypsin
Score n k AC DE OS
159.4 15 9 P80049; FATTY ACID-BINDING PROTEIN, LIVER (FABP). GINGLYMOSTOMA
CIRRATUM (NURSE SHARK).
76.2 28 5 P22966; ANGIOTENSIN-CONVERTING ENZYME PRECURSOR, TESTIS-SPECIFIC (EC
220.127.116.11) (ACE) (DIPEPTIDYL CARBOXYPEPTIDASE I) (KININASE
II). HOMO SAPIENS (HUMAN).
72.4 11 4 P16291; COAGULATION FACTOR IX (EC 18.104.22.168) (CHRISTMAS FACTOR)
(FRAGMENT). OVIS ARIES (SHEEP).
72.3 25 2 P18416; TRANSPOSASE (TRANSPOSON TN552) (ORF 480). STAPHYLOCOCCUS
71.0 5 6 P08821; DNA-BINDING PROTEIN II (HB) (HU). BACILLUS SUBTILIS, AND
66.9 23 7 P13214; ANNEXIN IV (LIPOCORTIN IV) (ENDONEXIN I) (CHROMOBINDIN 4)
(PROTEIN II) (P32.5) (PLACENTAL ANTICOAGULANT PROTEIN II)
(PAP-II) (PP4-X) (35-BETA CALCIMEDIN). BOS TAURUS (BOVINE).
. . . . .
The first column measures the quality of the match between the
given weights and a protein sequence in the database. The higher
the score, the better the match. The hits are listed in decreasing
scoring order. The second column, identified by n, indicates the
number of fragments that will result from the digestion of the
found protein. The third column, identified with k, indicates the
number of given weights which were successfully matched against
the theoretical digestion. The score is calculated from the total
number of fragments, the number of given weights matched, and from
how closely these weights could be matched. The fourth column
indicates the accession number of the sequence in SwissProt. The
rest of each line contains the description and species of the
sequence which serve as a quick guide to identify the protein.
A complete description of the algorithm and the probability
foundations can be found in chapter 20 of "A tutorial introduction
to computational biochemistry using the Darwin system" by G.H.
The boundary between insignificant and significant matches is
around 70. Scores less than 70 are not very significant, while
scores greater than 70 are significant.
The enzymes which are presently recognized, and the names to be
used, are the following (courtesy of Amos Bairoch)
Enzyme name cuts between except for
########### ############ ##########
Chymotrypsin Trp-Xaa,Phe-Xaa,Tyr-Xaa, Trp-Pro,Phe-Pro,Tyr-Pro,
PostProline Pro-Xaa Pro-Pro
TrypsinArgBlocked Lys-Xaa Lys-Pro
TrypsinCysModified Arg-Xaa,Lys-Xaa,Cys-Xaa Arg-Pro,Lys-Pro,Cys-Pro
TrypsinLysBlocked Arg-Xaa Arg-Pro
Trypsin Arg-Xaa,Lys-Xaa Lys-Pro
V8AmmoniumAcetate Glu-Xaa Glu-Pro
V8PhosphateBuffer Asp-Xaa,Glu-Xaa Asp-Pro,Glu-Pro
Please report any problems with the server to
knecht at inf.ethz.ch
More information about the Comp-bio