Adlai Burman wrote:
>I've been trying to figure out something that should be simple but I seem
>to be stuck. Perhaps someone can help. I need to figure out what the
>probability of NOT finding a fragment which is r bp long in a sequence
>which is n bp long. I can do this for individual cases but I need a
>purely symbolic solution which would cover all cases. This is all
>assuming a completely random sequence.
>If anyone happens to know off the top of their heads what the solution is
>it would be grotesquely appreciated.
The problem can be turned around and solved as follows:
Let R be the string that is r bp long, and
let N be the string that is n bp long.
Then P[R in N] = 1 - P[R not in N]
= 1 - P[R not in positions 1..r of N]*
P[R not in positions 2..r+1 of N]*
...
P[R not in positions n-r+1..n of N]
= 1 - (1 - P[R in positions 1..r of N])*
(1 - P[R in positions 2..r+1 of N])*
...
(1 - P[R in positions n-r+1..n of N])
= 1 - (1 - ((1/4)^r)*
(1 - ((1/4)^r)*
...
(1 - ((1/4)^r)
= 1 - ((1 - ((1/4)^r))^(n-r+1))
where * is multiplication and ^ is exponentiation (power).
If the alphabet is of size other than 4 (as in proteins),
then substitute the size of the alphabet for 4 above.
I.e., if alphabet size is s, then
P[R in N] = 1 - ((1 - ((1/s)^r))^(n-r+1))
-Paul