IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

Help Combinatorics

Paul F. Stelling stelling at suffix.cs.ucdavis.edu
Tue May 7 21:58:25 EST 1996

Adlai Burman wrote:

>I've been trying to figure out something that should be simple but I seem 
>to be stuck. Perhaps someone can help. I need to figure out what the 
>probability of NOT finding a fragment which is r bp long in a sequence 
>which is n bp long. I can do this for individual cases but I need a 
>purely symbolic solution which would cover all cases. This is all 
>assuming a completely random sequence. 
>If anyone happens to know off the top of their heads what the solution is 
>it would be grotesquely appreciated.

The problem can be turned around and solved as follows:
Let R be the string that is r bp long, and
let N be the string that is n bp long.

Then P[R in N] = 1 - P[R not in N]
               = 1 - P[R not in positions 1..r of N]*
                     P[R not in positions 2..r+1 of N]*
                     P[R not in positions n-r+1..n of N]
               = 1 - (1 - P[R in positions 1..r of N])*
                     (1 - P[R in positions 2..r+1 of N])*
                     (1 - P[R in positions n-r+1..n of N])
               = 1 - (1 - ((1/4)^r)*
                     (1 - ((1/4)^r)*
                     (1 - ((1/4)^r)
               = 1 - ((1 - ((1/4)^r))^(n-r+1))

where * is multiplication and ^ is exponentiation (power).

If the alphabet is of size other than 4 (as in proteins),
then substitute the size of the alphabet for 4 above.
I.e., if alphabet size is s, then

     P[R in N] = 1 - ((1 - ((1/s)^r))^(n-r+1))


More information about the Mol-evol mailing list

Send comments to us at biosci-help [At] net.bio.net