Biosequences .. Software .. Molbio soft .. Network News .. FTP

# Palindromic & repeated DNA

Zharkikh Andrey GSBS1022%UTSPH.THENET at LIB.TMC.EDU
Wed Apr 1 10:55:35 EST 1992

>I found a region of just over 300 base pairs which is
>palindromic, with 40% identity. The palendrome is perfect, you can find an
>axis of symmetry from which you can produce a mirror image for all of the
>matched positions (which must be true in order for the thing to be a palindrome)
>But as I said, the match is not even close to 100%.
>I decided that this must be some kind of artifact, or else a coincidence -
>something that looks odd, but based on the laws of probability is not actually
>that unusual. So for a comparison, I took what I know is a sequence which
>codes for a gene (a malate dehydrogenase from maize), and I did the same thing
>to it. I did this just to see whether something like this would happen by
>chance on another piece of DNA. To my surprise I found a similar region of
>about 100 base pairs, and this time the level of identity was 65%. I thought
>perhaps this was some kind of transcriptional or translational control region,
>but it turned out to be in the signal peptide of the protein!
>--
>Mary C. Metzler

You can use approximate formula to estimate the probability
of random occurence of identity you observed or best:

X = (M - np)/sqrt[np(1-p)]

Prob = F(X) * (N-2n)

where   p=0.25 (if all nucleotides are equiprobable)
n - the length of the region of complementarity
M - the number of complementary bases
N - total length of the sequence containing
the complementarity
F(X) - integral of the Normal distribution function
from X to infinity (taken from statistical tables).

So, in your case n=300, M=300*0.4=120. N is possibly about 1000 (?).
We get X=6.0, and from stat. table F(X)=0.000000000987, and
Prob=0.00000039.

For maise gene, n=100, M=100*0.65=65, X=5.01, F(X)=0.000000272, and
Prob=0.00022.

If different nucleotides are not in equal amount, p should be
recalculated as

p = 2*(pA)*(pT) + 2*(pC)(pG),

where pA, pT, pG, and pC are frequencies of A, T, G, and C.

Moreover, if proteins have regular structure (helices, sheets etc.)
it can increase the probability of random occurrence of extensive
matches.

PS: Maybe my logic is incorrect! :-)

Andrey Zharkikh