Molecular Weight search software wanted
Alan Bleasby
ajb at s-crim1.dl.ac.uk
Fri Jul 2 19:09:07 EST 1993
Reinhard asked a question about molecular weight searching.
Such software is available on the UK EMBnet node and will be made
available via an email server in the next few days. I append a
software description below. Details of the software can be found in the
last issue of Current Biology
Rgds
Alan Bleasby
?RC Daresbury Laboratory
Warrington WA4 4AD
UK
Daresbury is the UK EMBnet national node.
MOWSE MANUAL
Version 1.0
D.J.C.Pappin and A.J.Bleasby
[1] Introduction:
[2] Construction of the MOWSE database.
[2.1] Source database.
[2.2] Calculation of Molecular weight fragments.
[2.3] MOWSE database structure.
[2.4] The MW primary fragment molecular weight file.
[2.5] The MDX file OWL entry index.
[2.6] The SMW whole sequence molecular weight file.
[2.7] Program Requirements.
[2.8] MOWSE Scoring scheme.
[2.9] Simulation studies.
[3] References.
[4] Running database searches.
[4.1] Mw data file.
[4.2] Running the program.
[5] Results listing.
[5.1] Specified search parameters.
[5.2] Short 'hit' listing.
[5.3] Detailed 'hit' listing.
[5.4] Example of output listing.
[1] Introduction:
Determination of molecular weight has always been an important aspect
of the characterization of biological molecules. Protein molecular
weight data, historically obtained by SDS gel electrophoresis or gel
permeation chromatography, has been used to establish purity, detect
post-translational modification (such as phosphorylation or
glycosylation) and aid identification. Until just over a decade ago,
mass spectrometric techniques were limited to relatively small
biomolecules, as proteins and nucleic acids were too large and fragile
to withstand the harsh physical processes required to induce
ionization. This began to change with the development of 'soft'
ionization methods such as fast atom bombardment (FAB)[1],
electrospray ionisation (ESI) [2,3] and matrix-assisted laser
desorption ionisation (MALDI)[4], which can effect the efficient
transition of large macromolecules from solution or solid crystalline
state into intact, naked molecular ions in the gas phase. As an added
bonus to the protein chemist, sample handling requirements are minimal
and the amounts required for MS analysis are in the same range, or
less, than existing analytical methods.
As well as providing accurate mass information for intact proteins,
such techniques have been routinely used to produce accurate peptide
molecular weight 'fingerprint' maps following digestion of known
proteins with specific proteases. Such maps have been used to confirm
protein sequences (allowing the detection of errors of translation,
mutation or insertion), characterise post-translational modifications
or processing events and assign disulphide bonds [5,6].
Less well appreciated, however, is the extent to which such peptide
mass information can provide a 'fingerprint' signature sufficiently
discriminating to allow for the unique and rapid identification of
unknown sample proteins, independent of other analytical methods such
as protein sequence analysis.
The following text describes the construction and development of the
MOWSE peptide mass database (for MOlecular Weight SEarch) at the SERC
Daresbury Laboratory. Practical experience has shown that sample
proteins can be uniquely identified using as few as 3- 4
experimentally determined peptide masses when screened against a
fragment database derived from over 50,000 proteins. Experimental
errors of a few Daltons are tolerated by the scoring algorithms,
permitting the use of inexpensive time-of-flight mass spectrometers.
As with other types of physical data, such as amino acid composition
or linear sequence, peptide masses can clearly provide a set of
determinants sufficiently unique to identify or match unknown sample
proteins. Peptide mass fingerprints can prove as discriminating as
linear peptide sequence, but can be obtained in a fraction of the time
using less material. In many cases, this allows for a rapid
identification of a sample protein before committing to protein
sequence analysis. Fragment masses also provide structural
information, at the protein level, fully complementary to large-scale
DNA sequencing or mapping projects [7,8,9].
[2] Construction of the MOWSE database.
[2.1] Source database.
MOWSE was created from the OWL non-redundant composite protein
sequence database [10,11]. The latest release (version 18.1) contains
51,093 protein entries (comprising some 15,956,287 residues), derived
from:
Residues Entries
SWISSPROT Rel 22 25044 8375696
NBRF Rel 33(PIR 1) 942 374576
NBRF Rel 33(PIR 2) 4660 1122905
NBRF Rel 33(PIR 3) 7688 2200541
GenBank Rel 72 8200 2601590
NRL_3D Rel 9 (June 1992) 1352 224520
[2.2] Calculation of Molecular weight fragments.
For each entry in the source OWL database, MOWSE derives both whole
sequence molecular weight and calculated peptide molecular weights for
complete digests using the range of cleavage reagents and rules
detailed in Table 1. Cleavage is disallowed if the target residue is
followed by proline (except for CNBr or Asp N). Glu C (S. aureus V8
protease) cleavages are also inhibited if the adjacent residue is
glutamic acid. Peptide mass calculations are based entirely on the
linear sequence and use the average isotopic masses of amide-bonded
amino acid residues (IUPAC 1987 relative atomic masses). To allow for
N-terminal hydrogen and C-terminal hydroxyl the final calculated
molecular weight of a peptide of N residues is given by the equation:
N
S residue mass + 18.0153
n=1
Molecular weights are rounded to the nearest integer value before
being entered into the database. Cysteine residues are calculated as
the free thiol, anticipating that samples are reduced prior to mass
analysis. CNBr fragments are calculated as the homoserine lactone
form. Information relating to post- translational modification
(phosphorylation, glycosylation etc.) is not incorporated into
calculation of peptide masses.
Reagent no. Reagent Cleavage rule Total peptides
1 Trypsin C-term to K/R 1711729
2 Lys C C-term to K 922337
3 Arg C C-term to R 835392
4 Asp N N-term to D 835002
5 Glu C (Bicarbonate) C-term to E 915708
6 " (Phosphate) C-term to E/D 1793285
7 Chymotrypsin C-term to F/W/Y/L/M 3047947
8 CNBr C-term to M 392924
Table 1: Cleavage reagents modelled by MOWSE.
[2.3] MOWSE database structure.
The database consists of three binary files:
i) MOWSE.MW The primary file containing the fragment
molecular weights.
ii) MOWSE.MDX Index file relating OWL identifier codes
to the molecular weight information
in the primary Mw file.
iii) MOWSE.SMW Calculated molecular weights of intact OWL
sequences.
The query program accesses the binary information transparently from
the user viewpoint. In the internal representation the molecular
weight (and other) integers are stored as 4-byte machine specific
quantities. The binary files can be transferred between machines of
the same 'endian' nature, but 'cross-endian' transfer is not possible.
The MOWSE software allows recreation of the files on any platform
supporting a standard C language compiler. The organisation of the
database files is described below.
[2.4] The MW primary fragment molecular weight file.
Fragment molecular weight entries in this file map sequentially to the
order of entries within the source (OWL) protein sequence file. Each
MW file entry consists of 4 blocks and are shown below. The MW entries
are catenated.
Block 1 OWL Entry Code 20 bytes
Block 2 OWL Title Line 80 bytes
Block 3 Reagent Table 80 bytes
Block 4 Reagent 1 4 byte
Reag
More information about the Bio-soft
mailing list