Molecular Weight search software wanted

Alan Bleasby ajb at
Fri Jul 2 19:09:07 EST 1993

Reinhard asked a question about molecular weight searching.
Such software is available on the UK EMBnet node and will be made
available via an email server in the next few days. I append a
software description below. Details of the software can be found in the
last issue of Current Biology


Alan Bleasby
?RC Daresbury Laboratory
Warrington WA4 4AD

Daresbury is the UK EMBnet national node.


				Version 1.0
			D.J.C.Pappin and A.J.Bleasby

	[1] Introduction:

	[2] Construction of the MOWSE database.

		[2.1] Source database.
		[2.2] Calculation of Molecular weight fragments.
		[2.3] MOWSE database structure.
		[2.4] The MW primary fragment molecular weight file.
		[2.5] The MDX file OWL entry index.
		[2.6] The SMW whole sequence molecular weight file.
		[2.7] Program Requirements.
		[2.8] MOWSE Scoring scheme.
		[2.9] Simulation studies.

	[3] References.

	[4] Running database searches.

		[4.1] Mw data file.
		[4.2] Running the program.

	[5] Results listing.

		[5.1] Specified search parameters.
		[5.2] Short 'hit' listing.
		[5.3] Detailed 'hit' listing.
		[5.4] Example of output listing.

[1] Introduction:

Determination of molecular weight has always been an  important aspect
of the characterization  of  biological  molecules.  Protein molecular
weight data, historically  obtained by SDS  gel electrophoresis or gel
permeation chromatography,  has  been used to establish purity, detect
post-translational    modification   (such   as   phosphorylation   or
glycosylation) and  aid identification. Until just over a decade  ago,
mass  spectrometric  techniques  were  limited  to   relatively  small
biomolecules, as proteins and nucleic acids were too large and fragile
to  withstand  the  harsh  physical   processes  required  to   induce
ionization.   This  began  to change with the  development  of  'soft'
ionization   methods  such  as   fast   atom   bombardment   (FAB)[1],
electrospray   ionisation  (ESI)   [2,3]  and  matrix-assisted   laser
desorption  ionisation  (MALDI)[4],  which  can  effect  the efficient
transition of large macromolecules from solution or  solid crystalline
state into  intact, naked molecular ions in the gas phase. As an added
bonus to the protein chemist, sample handling requirements are minimal
and the amounts required  for MS  analysis  are in the same range,  or
less, than existing analytical methods.

As well as providing  accurate mass information  for intact  proteins,
such techniques have been  routinely used  to produce accurate peptide
molecular  weight  'fingerprint'  maps  following  digestion of  known
proteins with specific proteases.  Such maps have been used to confirm
protein  sequences (allowing the  detection of errors  of translation,
mutation or insertion), characterise post-translational  modifications
or processing events and assign disulphide bonds [5,6].

Less well appreciated, however, is  the  extent to which  such peptide
mass information  can  provide a 'fingerprint' signature  sufficiently
discriminating to  allow  for the unique and rapid  identification  of
unknown sample proteins, independent of  other analytical methods such
as protein sequence analysis.

The following  text describes the construction and development of  the
MOWSE peptide mass database (for MOlecular Weight SEarch)  at the SERC
Daresbury  Laboratory.  Practical experience  has  shown  that  sample
proteins  can   be  uniquely   identified  using  as  few   as   3-  4
experimentally  determined  peptide  masses when  screened  against  a
fragment database  derived  from  over  50,000 proteins.  Experimental
errors  of a  few  Daltons  are tolerated by  the  scoring algorithms,
permitting the use of inexpensive  time-of-flight  mass spectrometers.
As  with other  types of physical data, such as amino acid composition
or linear  sequence,  peptide masses  can clearly  provide  a  set  of
determinants sufficiently  unique to  identify or match unknown sample
proteins. Peptide mass  fingerprints  can prove  as discriminating  as
linear peptide sequence, but can be obtained in a fraction of the time
using  less  material.   In  many  cases,  this  allows  for  a  rapid
identification  of a  sample  protein  before  committing  to  protein
sequence   analysis.   Fragment   masses   also   provide   structural
information, at  the protein level, fully complementary to large-scale
DNA sequencing or mapping projects [7,8,9].

[2] Construction of the MOWSE database.

[2.1] Source database.

MOWSE  was  created  from  the  OWL  non-redundant  composite  protein
sequence  database [10,11]. The latest release (version 18.1) contains
51,093 protein entries (comprising  some 15,956,287 residues), derived

                                     	Residues       Entries
SWISSPROT Rel 22              	         25044     	8375696	   
NBRF Rel 33(PIR 1)          		   942       	 374576	   
NBRF Rel 33(PIR 2)      	      	  4660      	1122905	   
NBRF Rel 33(PIR 3)           	 	  7688      	2200541
GenBank Rel 72               		  8200      	2601590	   
NRL_3D Rel 9 (June 1992) 		  1352      	 224520	   

[2.2] Calculation of Molecular weight fragments.

For each  entry in  the source OWL database, MOWSE derives both  whole
sequence molecular weight and calculated peptide molecular weights for
complete  digests using the  range  of  cleavage  reagents  and  rules
detailed in Table 1. Cleavage  is disallowed if the target residue  is
followed by proline (except for CNBr  or Asp N).   Glu C (S. aureus V8
protease) cleavages  are also  inhibited  if  the adjacent  residue is
glutamic acid.  Peptide mass calculations  are  based  entirely on the
linear sequence  and use the average isotopic  masses  of amide-bonded
amino acid residues (IUPAC 1987 relative atomic  masses). To allow for
N-terminal  hydrogen and  C-terminal  hydroxyl  the  final  calculated
molecular weight of a peptide of N residues is given by the equation:

               S residue mass + 18.0153

Molecular weights  are rounded  to the nearest  integer  value  before
being entered into  the database. Cysteine residues are  calculated as
the free thiol, anticipating that  samples are  reduced  prior to mass
analysis.  CNBr  fragments  are calculated as  the homoserine  lactone
form.  Information  relating   to  post-   translational  modification
(phosphorylation,  glycosylation   etc.)   is  not  incorporated  into
calculation of peptide masses.
   Reagent no.	Reagent			Cleavage rule	    Total peptides 

	1	Trypsin			C-term to K/R		1711729
	2	Lys C			C-term to K		922337
	3	Arg C			C-term to R		835392
	4	Asp N			N-term to D		835002
	5	Glu C (Bicarbonate)	C-term to E		915708
	6	   "     (Phosphate)	C-term to E/D		1793285
	7	Chymotrypsin		C-term to F/W/Y/L/M	3047947
	8	CNBr			C-term to M		392924

	Table 1: Cleavage reagents modelled by MOWSE.

[2.3] MOWSE database structure.

The database consists of three binary files:

i) MOWSE.MW 		The primary file containing the fragment 
			molecular weights.

ii) MOWSE.MDX 		Index file relating OWL identifier codes 
			to the molecular weight information 
			in the primary Mw file.

iii) MOWSE.SMW		Calculated molecular weights of intact OWL 

The  query program accesses the  binary information transparently from
the  user  viewpoint.  In  the internal  representation the  molecular
weight  (and  other) integers are stored  as 4-byte  machine  specific
quantities. The  binary files  can be transferred  between machines of
the same 'endian' nature, but 'cross-endian' transfer is not possible.
The MOWSE  software  allows recreation  of  the  files on any platform
supporting a  standard  C language compiler.  The organisation of  the
database files is described below.

[2.4] The MW primary fragment molecular weight file.

Fragment molecular weight entries in this file map sequentially to the
order  of entries within the source (OWL) protein sequence file.  Each
MW file entry consists of 4 blocks and are shown below. The MW entries
are catenated.

	Block 1	OWL Entry Code	20 bytes
	Block 2	OWL Title Line	80 bytes
	Block 3	Reagent Table	80 bytes
	Block 4	Reagent 1	4 byte

More information about the Bio-soft mailing list