> In article <6anmv7$osi at net.bio.net> Iddo Friedberg <idoerg at cc.huji.ac.il> writes:
> >This Monte-Carlo strategy of evaluating alignment scores is being used
> >routinely in the GCG sequence alignment programs. Basically, the idea is
> >as you stated it. Once you make, say, 100 randomizations, you get a
> >normal distribution of scores (vs. the random) with a given mean, and
> ^^^^^^^^^^^^^^^^^^^
> >standard deviation. In my group, we use the rule-of-thumb that if the
> >non-random score is >6 S.D. above the random score, then there might be
> >some biological significance. This seems like a bit of a harsh rule, as
> >it is common wisdom the 2-3 standard deviations are enough for
> >statistical significance.
...
Sean Eddy <eddy at wol.wustl.edu> writes:
>> And it's since been shown (papers by Karlin, Altschul, and others)
> that the reason for this is that the score distribution for local
> alignments is not a normal distribution. Z-scoring is unreliable,
> giving overestimates of how significant a score is. The score
> distribution is instead closer to an extreme value distribution, with
> a longer tail than the Gaussian.
It isn't so much that the Z-score is flawed, it's just that the 2 -
3 S.D. rule does not apply for the extreme-value distribution. To
estimate the probability of obtaining a score with a Z-value > z IN A
SINGLE COMPARISON you must use the following formula for the
extreme-value distribution:
p(Z > z ) = 1 - exp ( - exp (-1.282555 * z - 0.577216))
However, remember that when you do a database search (or a Monte-Carlo
shuffle), you never compare your sequence only to the highest scoring
sequence, you compare it to 50,000 (database search) or (100 - 1,000 -
shuffled sequences. So, the number of times you expect to obtain a
z-value > z after 50,000 comparisons is:
E(Z > z) = N * p(Z > z), where N=50,000.
The probability of obtaining this E() (expectation) value can be calculated
using the poisson formula:
p(Z > z: N comparisons) = 1 - exp (-E), where E is the
expectation calculated above. (Note that this formula would also be
required if the probability distribution were Gaussian. The 2 -3 S.D.
rule applies only when you have done only one test.)
For the SwissProt protein sequence database (~50,000 entries), the
z-value = 6 recommended in the original posting would have a
probability of:
p( Z > 6 ) = 2.554E-4
For N=50,000, E() = 2.554e-4 * 50,000 = 12.7, which means that a score
with a Z-value of 6 is expected about 13 times per database search and
thus the similarity score is not unexpectedly high (not statistically
significant).
In a Monte-Carlo shuffle done 100 times, the E() = 2.554E-4 * 100 =
0.0255, which is not very signifcant either. But, since one almost
never finds a sequence by doing only 100 comparisons, the Monte-Carlo
expectation value is artificially low.
I typically do 500 - 1,000 shuffles, and then report an expectation
value based on the search of a database of 10,000 - 50,000 sequences.
Bill Pearson