A peptide mass fingerprint email server service is now available
by emailing
mowse at dl.ac.uk
The help file, available from this address, is reproduced below and
describes the database and how to access it. It allows identification
of known proteins from a set of molecular weights (mass spec) determined
after proteolytic digests.
Alan Bleasby
SERC Daresbury Laboratory
********************************
The MOWSE peptide mass database:
********************************
Imperial Cancer Research Fund
and
SERC Daresbury Laboratory
D.J.C. Pappin, P. Hojrup and A.J. Bleasby
'Rapid Identification of Proteins by
Peptide-Mass Fingerprinting'.
Current Biology (1993), vol 3, 327-332.
InterNet server version:
Table of Contents:
[1] Background.
[2] Construction of the MOWSE database.
[2.1] Source database.
[2.2] Calculation of Molecular weight fragments.
[3] Running database searches via e_mail.
[4] Example of mail query format.
[5] Results listing.
[6] Database structure.
[6.1] MOWSE database structure.
[6.2] The MW primary fragment molecular weight file.
[6.3] The MDX file OWL entry index.
[6.4] The SMW whole sequence molecular weight file.
[6.5] Program Requirements.
[6.6] MOWSE Scoring scheme.
[6.7] Simulation studies.
[7] General references.
[1] Background:
Determination of molecular weight has always been an
important aspect of the characterization of biological molecules.
Protein molecular weight data, historically obtained by SDS gel
electrophoresis or gel permeation chromatography, has been used
establish purity, detect post-translational modification (such as
phosphorylation or glycosylation) and aid identification. Until
just over a decade ago, mass spectrometric techniques were typically
limited to relatively small biomolecules, as proteins and nucleic
acids were too large and fragile to withstand the harsh physical
processes required to induce ionization. This began to change with
the development of 'soft' ionization methods such as fast atom
bombardment (FAB)[1], electrospray ionisation (ESI) [2,3] and
matrix-assisted laser desorption ionisation (MALDI)[4], which can
effect the efficient transition of large macromolecules from
solution or solid crystalline state into intact, naked molecular
ions in the gas phase. As an added bonus to the protein chemist,
sample handling requirements are minimal and the amounts required
for MS analysis are in the same range, or less, than existing
analytical methods.
As well as providing accurate mass information for intact
proteins, such techniques have been routinely used to produce
accurate peptide molecular weight 'fingerprint' maps following
digestion of known proteins with specific proteases. Such maps
have been used to confirm protein sequences (allowing the
detection of errors of translation, mutation or insertion),
characterise post-translational modifications or processing events
and assign disulphide bonds [5,6].
Less well appreciated, however, is the extent to which such
peptide mass information can provide a 'fingerprint' signature
sufficiently discriminating to allow for the unique and rapid
identification of unknown sample proteins, independent of other
analytical methods such as protein sequence analysis.
The following text describes the construction and use
of the MOWSE peptide mass database (for MOlecular Weight SEarch)
at the SERC Daresbury Laboratory. Practical experience has shown
that sample proteins can be uniquely identified using as few as 3-
4 experimentally determined peptide masses when screened against a
fragment database derived from over 50,000 proteins. Experimental
errors of a few Daltons are tolerated by the scoring algorithms,
permitting the use of inexpensive time-of-flight mass
spectrometers. As with other types of physical data, such as amino
acid composition or linear sequence, peptide masses can clearly
provide a set of determinants sufficiently unique to identify or
match unknown sample proteins. Peptide mass fingerprints can prove
as discriminating as linear peptide sequence, but can be obtained
in a fraction of the time using less material. In many cases, this
allows for a rapid identification of a sample protein before
committing to protein sequence analysis. Fragment masses also
provide structural information, at the protein level, fully
complementary to large-scale DNA sequencing or mapping projects
[7,8,9].
[2] Construction of the MOWSE database.
[2.1] Source database.
MOWSE was created from the OWL non-redundant composite
protein sequence database [10,11]. The first InterNet release (version
20.1) contains some 61,000 protein entries, generating approximately
15,000,000 peptide fragments. The MOWSE fragment database will be updated
with each new release of the parent OWL database (every 2 months or so).
[2.2] Calculation of Molecular weight fragments.
For each entry in the source OWL database, MOWSE derives both
whole sequence molecular weight and calculated peptide molecular
weights for complete digests using the range of cleavage reagents
and rules detailed in Table 1. Cleavage is disallowed if the
target residue is followed by proline (except for CNBr or Asp N).
Glu C (S. aureus V8 protease) cleavages are also inhibited if the
adjacent residue is glutamic acid. Peptide mass calculations are
based entirely on the linear sequence and use the average isotopic
masses of amide-bonded amino acid residues (IUPAC 1987 relative
atomic masses). To allow for N-terminal hydrogen and C-terminal
hydroxyl the final calculated molecular weight of a peptide of N
residues is given by the equation:
N
__
\
/ Residue mass + 18.0153
--
n=1
Molecular weights are rounded to the nearest integer value
before being entered into the database. Cysteine residues are
calculated as the free thiol, anticipating that samples are
reduced prior to mass analysis. CNBr fragments are calculated as
the homoserine lactone form. Information relating to post-
translational modification (phosphorylation, glycosylation etc.)
is not incorporated into calculation of peptide masses.
Reagent no. Reagent Cleavage rule
1 Trypsin C-term to K/R
2 Lys-C C-term to K
3 Arg-C C-term to R
4 Asp-N N-term to D
5 V8-bicarb C-term to E
6 V8-phosph C-term to E/D
7 Chymotrypsin C-term to F/W/Y/L/M
8 CNBr C-term to M
Table 1: Cleavage reagents modelled by MOWSE.
[3] Running database searches by e_mail:
********************************************************************
Search queries should be mailed to mowse at daresbury.ac.uk (short form
mowse at dl.ac.uk). Search results will be returned directly to your
e_mail address. Comments, please, to mbdpn at s-crim1.dl.ac.uk.
********************************************************************
The 'subject' field of your email message is irrelevant - all
parameters must be specified in the body of the message. The relevant
syntax is given below. Some lines are compulsory, others are optional
(see the description of parameters section).
All text is case-insensitive, and MOWSE expects integer data. Non-exponential
floating point syntax is acceptable, but MOWSE will round the data to the
nearest integer. Whitespace is ignored in an intuitive way.
MOWSE recognises the following command lines which are further
described below
Begin
Reagent
Tolerance
SeqMW
Filter
Datastart
Dataend
The order of lines is irrelevant with the exception of 'begin' and the
'datastart/dataend' commands (see below).
If multiple instances of a command occur then only the FIRST instance
will be recognised
Begin
Every search query MUST start with a 'begin' line. There should only
be one 'begin' line and all other commands & data should immediately
follow.
Reagent
Every search query MUST specify a 'reagent' line. The word 'reagent'
must be followed by one of the supported cleavage reagents. These are:
Trypsin
Lys-C
Arg-C
Asp-N
V8-bicarb
V8-phosph
Chymotrypsin
CNBr
A typical reagent line is therefore of the form:
reagent trypsin
Tolerance
This line is optional. The supplied number specifies the error
allowed for mass accuracy of experimental mass determination. If no
figure is specified, a default tolerance of 2 Daltons will
be assumed. If you wish to specify a different tolerance then follow
the word 'tolerance' with the required number of Daltons e.g.
tolerance 1
In this case, supplied peptide masses will be matched to +/- 1
Daltons. Values of 2-4 are suggested for data obtained by laser-
desorption TOF instruments. Accuracies of +/- 2 Daltons or better are
generally only possible using an appropriate internal standard (e.g.
oxidised insulin B chain) with TOF instruments.
For electrospray or FAB data, a value of 1 can be selected in most
cases. If you have real confidence in mass determination, specify '0'
(zero) to limit matches to the nearest integer value (effectively +/- 0.5
Daltons). Discrimination is significantly improved by the selection of a
small error tolerance.
SeqMW
This optional line allows you to give the molwt of the whole protein (if
known). This allows you to limit the search to proteins of this molwt
plus/minus a 'limit' (see below).
If unspecified, a whole protein molwt of 0 is assumed which MOWSE
interprets as "search the whole database". This will include all proteins
up to the maximum size of just under 700,000 Daltons.
You can specify any molwt in Daltons with this command e.g.
SeqMW 90000
Filter
This optional line is used in conjunction with the SeqMW command and
is meaningless without it. It specifies a percentage. Only proteins
of the given SeqMW +/- this percentage will be searched. If a SeqMW
is specified but Filter is unspecified then Filter will default to
25%. To specify a percentage of 30% use:
Filter 30
In this case, a molecular weight of 90,000 Daltons was
specified and the selection of 30 for the filter restricts the
search to those proteins with masses from 63,000 to 117,000
Daltons. A value of 25 is suggested for initial searches, which
can be progressively widened for subsequent search attempts if no
match