Periodicity of sequence lengths in protein databases
demchuk at embl-heidelberg.de
Tue Jun 9 15:41:49 EST 1992
> We seem to remember a paper published on the periodicity of the lengths of
> sequences in a protein sequence database. Unfortunately no one here can
> remember the authors, journal or title of the paper. Does any one out there
> know of this work, if so could you post or send the reference.
> As we recall, the paper showed a plot of number of protein sequences as a
> function of sequence length. The plot was not smooth, but showed a
> periodicity. The periodicity suggested that proteins were built from modules
> of x amino acids. We are interested in finding out if our vague memories are
> correct, what the reference is, and what the periodicity x is.
The periodicity is 5. It was discovered in our work. But, you are right, it
is not well published in English. The main results you can find in:
E.J.Demchuk, N.G.Esipova, V.G.Tymanyan (1991). Length regularities of
genetic texts. Proceedings of the First International Conference on
Electrophoresis, Supercomputing and the Human Genome. April 10-13 1990.
Held at Florida State University, Tallahassee, Florida. C.R. Cantor,
H.A. Lim (eds), World Scientific, Singapore, pp.279-285
and in 2 preliminary publications:
E.J.Demchuk, N.G.Esipova, V.G.Tymanyan (1988). Existence of correlation
between lengths of polypeptide chains of proteins. DOKLADY AKADEMII
NAUK SSSR (Russ.) vol.303, iss.5, pp.1262-1264
E.J.Demchuk, N.G.Esipova, V.G.Tymanyan (1989). Regularities in
arrangement and evolution of protein primary structures connected with
deletions-insertions found upon statistical analysis of data banks.
STUDIA BIOPHYSICA vol.129, iss.2-3, pp.193-199
The periodicity is a dominant for proteins, but nevertheless a weak one.
So, you must be careful while explaining it as "modules of x amino acids"
(compare it with results on exon sample which have definitely module
structure, and so are especially designed for exon shuffling). The
periodicity seems originate from a-helixes (look at signal peptide sample).
In the second paper you can also find some solution for your previous
> I have a question regarding deletions in sequence alignments. Most alignment
> programs use the formula ak+b to score an alignment of length k with b as the
> fixed gap penalty and a as the incremental gap penalty. What are the default
> values these parameters take in the common commercial alignment packages. Is
> there a consensus about what these values should be or do people simply use
> what works.
Together with Vladimir Tumanyan we carried out a work which is actually the
generalization of Dayhoff approach for indels: to collect statistics for
confident mutations in closely related sequences, formulate a law and try
to use it for study of distantly related sequences. The last paper is
concerned with the fist two tasks also. Another related publication is:
E.J.Demchuk, V.G.Tumanyan (1987). Statistical regularities of deletions-
insertions in proteins. DOKLADY AKADEMII NAUK SSSR (Russ.) vol.296,
I performed some work trying to understand how useful are suggested indel
weights. It was reported in one conference about two years ago and should
appear in its proceedings in the nearest future:
E.J.Demchuk, V.G.Tymanyan - Biological adequacy of protein sequence
alignments. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE "Modelling and
Computer Methods in Molecular Biology and Genetics" held at Novosibirsk
in august 1990 (in press, 1992).
I'll be very grateful if you will inform me on your results if you will try
to follow the recommendations published in above mentioned papers. But in
any case you should be careful in implementing them. They should work only
in probabilistic approach scoring scheme (like Dayhoff one) which is
seeking for maximum likelihood alignment and are valid only for proteins.
That is you should take care of scoring matrix you use. Neglecting of this
may lead you to totally wrong results which will be difficult to check (see
paper of Barton and Sternberg in Prot. Engineering 1986 as an example). And
another what you should remember that the proposed approach is a
statistical one. It will produce definitely better results while
implementing systematically but in any particular case it may be not the
Eugene Demchuk Demchuk at EMBL-Heidelberg.de EMBL
6900 Heidelberg FRG
Fax: (+49-6221) 387-517; (+49-6221) 387-306
Phone: (+49-6221) 387-553; (+49-6221) 387-0 (via exchange)
Telex: 461613 (embl d)
More information about the Mol-evol