Computer assisted amino acid analysis

Uwe Hobohm hobohm at embl-heidelberg.de
Thu Apr 28 09:35:06 EST 1994


AMINO ACID ANALYSIS AND PROTEIN DATABASE COMPOSITION SEARCH
  AS A FAST AND INEXPENSIVE METHOD TO IDENTIFY PROTEINS -

ANOUNCEMENT OF A FREE COMPUTER ANALYSIS SERVICE OF YOUR
                     AAA-DATA


AIM
We have recently developed a method to identify proteins
from amino acid analysis (AAA) data. The method has been
submitted for publication. We are interested to analyse
AAA data from different labs in order to explore the limits
of the method further.
If enough labs participate, the results may be published
as a survey at a later stage.


ABSTRACT
Computer assisted amino acid analysis (AAA) can be used to
identify minute amounts of protein samples, e.g. from 2D-
gel spots. Compared to protein sequencing, AAA is much 
cheaper and faster and allows higher sample throughput.
Automatic AAA systems perform amino acid hydrolysis,
derivatization and subsequent separation by HPLC almost
without manual intervention.

Thus, the method may replace protein sequencing as a first
identification attempt provided a homolog can be found in
the database.


PROPSEARCH DATABASE SEARCH METHOD
A computer program (PROPSEARCH) has been developed to 
analyse AAA data. Briefly, it works as follows:
For each protein in the Swissprot database, the content of 16
amino acids (D+N,E+Q,S,G,H,R,T,A,P,Y,V,M,I,L,F,K) is
calculated from the sequence. Asparagine and glutamine
are converted to their corresponding acids during hydrolysis
and cannot, therefore, be quantified. Tryptophane and
cysteine can not be measured reliably without extra
effort and are not considered. These 16 numbers, 
expressed in percent sequence length, are used as amino
acid composition. Values are normalized to add up to 100
percent, again normalized with the SD of the respective 
database average. The molecular weight is used as 17th
data point, the pI may be used as 18th data point.
A database search is performed by comparing
the experimentally determined amino acid composition and MW
with compositions calculated from the database sequences.
For each comparison between experimental data and a 
protein from the database a distance function is applied,
using amino acid specific weights, to calculate a distance.
Distances are rank ordered and the smallest 50 distances
are reported. The protein at the top of the list has the
highest probability of being identical to the search
protein.


WHICH EXPERIMENTAL ERROR IS TOLERATED ?
AAA is not able to determine amino acid content with
arbitrary precision, due to different stability and
hydrolysis kinetics of different amino acid residue
types. But the method tolerates a certain experimental
error. How large may this experimental error be ?
Assume, AAA has determined an alanine content of 7.5%,
but the protein has a real content of Ala of 8.5%
(for instance, the protein contains 17 Ala in a sequence 
of length 200 = 8.5%). The error for Ala then is 1%.
Assume, all other 15 amino acid types show an error of
1% as well, resulting in an overall experimental error
of 16%. Assume further, the molecular weight has been
determined from a 1D-gel with an error of 8% (i.e, 
determined MW was 20.2 kD, real MW 22 kD = 8%).
An experimental error of this kind is well tolerated
by the method. We could in many cases identify the
correct protein with overall errors up to 25%.
In contrast, many AAA are much more favourable,
especially if larger protein amounts were available,
resulting in experimental errors less than 5%.
The reliability as function of experimental error
has been investigated in detail. We supply the
reliability table together with a PROPSEARCH
database search result.


RELIABILITY OF PROTEIN IDENTIFICATION IS A FUNCTION
OF PROPER EXPERIMENTAL DATA
The distance calculated for the top scoring protein serves
as a reliability measure: A PROPSEARCH distance below 1.0 
indicates unambigous identification of the protein AND
"good" experimental data, while higher distances indicate
lower reliability, caused by either "bad" experimental data
or a homolog of the unknown protein has not yet been
stored in the database. In this case the top scoring hit
has been picked up by chance.


FRAGMENT SEARCH
Often AAA is done on protein fragments rather than entire
proteins. PROPSEARCH can be used in those cases as well by
shifting a window over sequences in the database, i.e. by
"cutting" database protein sequences in pieces of similar
size. The approximate molecular weight of the fragment is
needed here to determine the window size.


FREE PROPSEARCH AAA-DATA SERVICE
Please feel free to send us your AAA data by email. A 
PROPSEARCH computer analysis will be performed and the 
results sent back to you. We are interested in AAA data 
measured by different methods (PITC manual, PITC 
automatically, ninhydrin, fluorescense) to explore the 
limits of the method. IN PARTICULAR, DATA FROM SMALL
SAMPLES (I.E. FROM 1D- OR 2D-GELS) OF A *KNOWN* PROTEIN 
ARE HIGHLY APPRECIATED.


Please send your data to:     hobohm at embl-heidelberg.de

in the following format (comment lines begin with '#'):

#begin                                                          (please fill in)
#Date                                                      : 
#From [email-adress]                                       : 
#Protein [indicate if identity is known]                   : 
#AAA-method [PITC-manual,PITC-auto,ninhydrin,fluorescense] :
#Amount of sample [in micro-gram or micro-mol]             :
#
#AA          Residues/molecule [in percent sequence length or number of residues]
D               15         [Asx]
E        	19         [Glx]
S        	10         [Ser]
G        	16         [Gly]
H        	2          [His]
R        	8          [Arg]
T        	8          [Thr]
A        	14         [Ala]
P        	7          [Pro]
Y        	4          [Tyr]
V        	13         [Val]
M        	1          [Met]
I        	7          [Ile]
L        	16         [Leu]
F        	7          [Phe]
K        	16         [Lys]
MW              18400
pI              8.5
#end

Data can be sent in arbitrary order and no
fixed columns, i.e. with arbitray number of spaces.
The pI may be, but must not be supplied.
Please begin any non-data-line with '#'.



More information about the Methods mailing list