IUBio

Periodicity of sequence lengths in protein databases

Eugene Demchuk demchuk at embl-heidelberg.de
Tue Jun 9 15:41:49 EST 1992



> We seem to remember a paper published on the periodicity of the lengths of
> sequences in a protein sequence database. Unfortunately no one here can
> remember the authors, journal or title of the paper.  Does any one out there
> know of this work, if so could you post or send the reference.
> 
> As we recall, the paper showed a plot of number of protein sequences as a
> function of sequence length.  The plot was not smooth, but showed a
> periodicity. The periodicity suggested that proteins were built from modules
> of x amino acids.  We are interested in finding out if our vague memories are
> correct, what the reference is, and what the periodicity x is.
> 


The periodicity is 5. It was discovered in our work. But, you are right, it 
is not well published in English. The main results you can find in: 

   E.J.Demchuk, N.G.Esipova, V.G.Tymanyan (1991). Length regularities of 
   genetic texts. Proceedings of the First International Conference on 
   Electrophoresis, Supercomputing and the Human Genome. April 10-13 1990.  
   Held at Florida State University, Tallahassee, Florida. C.R. Cantor, 
   H.A.  Lim (eds), World Scientific, Singapore, pp.279-285 

and in 2 preliminary publications:

   E.J.Demchuk, N.G.Esipova, V.G.Tymanyan (1988). Existence of correlation 
   between lengths of polypeptide chains of proteins.  DOKLADY AKADEMII 
   NAUK SSSR (Russ.) vol.303, iss.5, pp.1262-1264 

   E.J.Demchuk, N.G.Esipova, V.G.Tymanyan (1989). Regularities in 
   arrangement and evolution of protein primary structures connected with 
   deletions-insertions found upon statistical analysis of data banks. 
   STUDIA BIOPHYSICA vol.129, iss.2-3, pp.193-199 

The periodicity is a dominant for proteins, but nevertheless a weak one. 
So, you must be careful while explaining it as "modules of x amino acids" 
(compare it with results on exon sample which have definitely module 
structure, and so are especially designed for exon shuffling). The 
periodicity seems originate from a-helixes (look at signal peptide sample).  

In the second paper you can also find some solution for your previous 
query: 


> I have a question regarding deletions in sequence alignments.  Most alignment
> programs use the formula ak+b to score an alignment of length k with b as the
> fixed gap penalty and a as the incremental gap penalty.  What are the default
> values these parameters take in the common commercial alignment packages.  Is
> there a consensus about what these values should be or do people simply use
> what works.


Together with Vladimir Tumanyan we carried out a work which is actually the 
generalization of Dayhoff approach for indels: to collect statistics for 
confident mutations in closely related sequences, formulate a law and try 
to use it for study of distantly related sequences. The last paper is 
concerned with the fist two tasks also. Another related publication is: 

   E.J.Demchuk, V.G.Tumanyan (1987). Statistical regularities of deletions-
   insertions in proteins. DOKLADY AKADEMII NAUK SSSR (Russ.) vol.296, 
   iss.6, pp.1488-1491 

I performed some work trying to understand how useful are suggested indel 
weights. It was reported in one conference about two years ago and should 
appear in its proceedings in the nearest future: 

   E.J.Demchuk, V.G.Tymanyan - Biological adequacy of protein sequence 
   alignments. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE "Modelling and 
   Computer Methods in Molecular Biology and Genetics" held at Novosibirsk 
   in august 1990 (in press, 1992).  

I'll be very grateful if you will inform me on your results if you will try 
to follow the recommendations published in above mentioned papers. But in 
any case you should be careful in implementing them. They should work only 
in probabilistic approach scoring scheme (like Dayhoff one) which is 
seeking for maximum likelihood alignment and are valid only for proteins. 
That is you should take care of scoring matrix you use. Neglecting of this 
may lead you to totally wrong results which will be difficult to check (see 
paper of Barton and Sternberg in Prot. Engineering 1986 as an example). And 
another what you should remember that the proposed approach is a 
statistical one. It will produce definitely better results while 
implementing systematically but in any particular case it may be not the 
perfect choice.  


Good luck,
Eugene Demchuk.

 --
Eugene Demchuk	 Demchuk at EMBL-Heidelberg.de	EMBL
						Postfach 10.2209
						Meyerhofstrasse 1
						6900 Heidelberg FRG

	Fax:	(+49-6221) 387-517;	(+49-6221) 387-306 
	Phone:	(+49-6221) 387-553;	(+49-6221) 387-0 (via exchange)   
	Telex:	461613 (embl d)



More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net