IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Palindromic & repeated DNA

Zharkikh Andrey GSBS1022%UTSPH.THENET at LIB.TMC.EDU
Wed Apr 1 10:55:35 EST 1992

>I found a region of just over 300 base pairs which is
>palindromic, with 40% identity. The palendrome is perfect, you can find an
>axis of symmetry from which you can produce a mirror image for all of the
>matched positions (which must be true in order for the thing to be a palindrome)
>But as I said, the match is not even close to 100%.
>I decided that this must be some kind of artifact, or else a coincidence -
>something that looks odd, but based on the laws of probability is not actually
>that unusual. So for a comparison, I took what I know is a sequence which
>codes for a gene (a malate dehydrogenase from maize), and I did the same thing
>to it. I did this just to see whether something like this would happen by
>chance on another piece of DNA. To my surprise I found a similar region of
>about 100 base pairs, and this time the level of identity was 65%. I thought
>perhaps this was some kind of transcriptional or translational control region,
>but it turned out to be in the signal peptide of the protein!
>Mary C. Metzler

You can use approximate formula to estimate the probability
of random occurence of identity you observed or best:

	X = (M - np)/sqrt[np(1-p)]

	Prob = F(X) * (N-2n)

where   p=0.25 (if all nucleotides are equiprobable)
	n - the length of the region of complementarity
	M - the number of complementary bases
	N - total length of the sequence containing
		the complementarity
	F(X) - integral of the Normal distribution function
		from X to infinity (taken from statistical tables).

So, in your case n=300, M=300*0.4=120. N is possibly about 1000 (?).
We get X=6.0, and from stat. table F(X)=0.000000000987, and

For maise gene, n=100, M=100*0.65=65, X=5.01, F(X)=0.000000272, and

If different nucleotides are not in equal amount, p should be
recalculated as 

		p = 2*(pA)*(pT) + 2*(pC)(pG),

where pA, pT, pG, and pC are frequencies of A, T, G, and C.

Moreover, if proteins have regular structure (helices, sheets etc.)
it can increase the probability of random occurrence of extensive

PS: Maybe my logic is incorrect! :-)

Andrey Zharkikh

More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net