# Frequency of BstE II cutting?

Chris Boyd chrisb at hgu.mrc.ac.uk
Thu Jun 27 11:13:29 EST 1996

Mikhail Alexeyev (malexeyev at biost1.thi.tmc.edu) wrote:
: In article <DtIG58.8wK.B.midge at bath.ac.uk>, bspwrb at bath.ac.uk (W R
: BENNETT) wrote:

: > If BstE II has a restriction site of GGTNACC, does it cut at the same
: > frequency as a six-cutter (i.e. an average 1 in 4096 disregarding sequence
: > distribution considerations), which is the "intuitive" answer, or does it
: > cut with reduced frequency (which is what I'd like!).  Promega's technical
: > department felt that it definitely cut at reduced frequency, but couldn't
: > really say why, or what the frequency was....
:
: Since 4096= 4^6 is a number of DIFFERENT 6-nucleotide combinations that
: can be composed out of 4 bases (A,T,G and C), the frequency of 7-base
: cutter (with non-degenerated recognition sequence) should be 1 in 4^7=
: 4096 x 4 =16384. However, since Bst EII recognition sequence is
: degenerated (redundant) at one position, there should be 4 recognition
: sequences (ggtAacc, ggtTacc, ggtCacc AND ggtGacc) for BstEII in every
: 16384 base pairs that makes it 1 in 4096. To put it in other words, there
: are only 6 positions that matter for recognition sequence, therefore, the
: frequency should be the same as for 6-base cutter. Same should be true for
: (hypothetical) enzymes with recognition sequences of GGT(N)xACC were x
: could be any (reasonable) number. However, for an enzyme with recognition
: sequence GGT(G/C)ACC the frequency of cutting should be 1 in 4096 x 2
: =8192.

: Yet another way to put it is in terms of probability to encounter a
: specific nucleotide at a specific position. For BstEII it should be:

:  G   G   T   N   A   C   C
: 1/4 1/4 1/4 4/4 1/4 1/4 1/4

: Probability is: (1/4)^6 x 4/4= 4^-6= 1 in 4096

Yes, this is a fair first approximation way of looking at this, and is
all you need for most applications. In reality, however, the occurrence
frequency of any given query sequence is markedly affected by the base
composition and sequence microstructure (CpG islands etc.) of the
target DNA.  E.g., CTAG is far rarer than GATC in the E. coli genome.

For pedantically accurate theoretical results, you unfortunately have
to do a Markov chain analysis to explain why, and calculate to what
extent, sequences with repeated adjacent bases are commoner than the
above naive analysis would suggest.

See the following for more details:

author =       "G. J. Phillips and J. Arnold and R. Ivarie",
title =        "{Mono- through hexanucleotide composition of the {\it
Escherichia coli} genome: a Markov chain analysis}",
journal =      "Nucleic Acids Res.",
volume =       "15",
pages =        "2611--2626",
year =         "1987",

author =       "G. J. Phillips and J. Arnold and R. Ivarie",
title =        "{The effect of codon usage on the oligonucleotide
composition of the {\it E.~coli} genome and
identification of over- and underrepresented sequences
by Markov chain analysis}",
journal =      "Nucleic Acids Res.",
volume =       "15",
pages =        "2627--2638",
year =         "1987",

Best wishes,
--
Chris Boyd                       | from, | MRC Human Genetics Unit
chrisb at hgu.mrc.ac.uk             |  not  |  Western General Hospital
http://www.hgu.mrc.ac.uk/~chrisb |   for |   Edinburgh EH4 2XU, SCOTLAND