How to calculate ?

Steven Brenner brenner at mole.bio.cam.ac.uk
Sun Aug 11 12:55:41 EST 1996


While very interesting, the equation below won't work (I think)
because the various parameters aren't all independent if you're only
considering the best (or near best) regions of alignment.

There is an extensive theory behind probabilities of two sequences
matching each other with a given level of similarity over a particular
length.  

An introduction to the theory, with many references, can be found on
the BLAST help page at:
  http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html

An excerpt is:

            From Karlin and  Altschul  (1990),  the  principal  equation
            relating  the  score  of an HSP to its expected frequency of
            chance occurrence is:

                               E = K N exp(-Lambda S)

            where E is the expected frequency of chance occurrence of an
            HSP having score S (or one scoring higher); K and Lambda are
            Karlin-Altschul parameters; N is the product  of  the  query
            and  database  sequence  lengths,  or the size of the search
            space; and exp is the exponentiation function.

            Lambda may be thought of as the expected increase in  relia-
            bility  of  an  alignment associated with a unit increase in
            alignment score.  Reliability in this case is  expressed  in
            units  of  information,  such  as bits or nats, with one nat
            being equivalent to 1/log(2) (roughly 1.44) bits.


leen at bio-3.bsd.uchicago.edu (Lee Newberg) writes:
>The average number of "matches" with exactly those parameters
>that arises randomly is not too difficult to figure out.
...
> Putting it all together gives

>E = (L1 + 1 - LR) * (L2 + 1 - LR) * (LR choose N) * (25%)^(LR-N) * (75%)^N

>In article <4u6oio$5tg at mserv1.dl.ac.uk>,
>Leonid A. Sadofiev <leosad at may.stud.pu.ru> wrote:
>> Dear all,
>> 
>> I can't find a good idea, how to calculate:
>> 
>> Than I comparing two sequences (amino acid or nucleotide)
>> with length L1 and L2, I get a common region with
>> length LR, containing N mismatches.
>> The questions are:
>> What a chance to obtain such region in unrelated sequences ?
>> Can I use the binomical formulas for this case ?
>> 
>> Could any body send me the formulas to calculate this chance
>> or reference for it ?
>> 
>> Please reply to leosad at may.stud.pu.ru
>> 
>> Thanks in advance.
>>                         Leonid A. Sadofiev
>> 



-- 
Steven E. Brenner                    | S.E.Brenner at bioc.cam.ac.uk 
MRC Laboratory of Molecular Biology  | 
Hills Road                           | Office:   +44 1223 248011
Cambridge CB2 2QH, UK                | Fax:      +44 1223 213556




More information about the Comp-bio mailing list