Dear all,Evaluating low identity scores

William R. Pearson wrp at alpha0.bioch.virginia.edu
Wed Jan 28 13:02:55 EST 1998


> In article <6anmv7$osi at net.bio.net> Iddo Friedberg <idoerg at cc.huji.ac.il> writes:
>   >This Monte-Carlo strategy of evaluating alignment scores is being used
>   >routinely in the GCG sequence alignment programs. Basically, the idea is
>   >as you stated it. Once you make, say, 100 randomizations, you get a
>   >normal distribution of scores (vs. the random) with a given mean, and
>    ^^^^^^^^^^^^^^^^^^^
>   >standard deviation. In my group, we use the rule-of-thumb that if the
>   >non-random score is >6 S.D. above the random score, then there might be
>   >some biological significance. This seems like a bit of a harsh rule,  as
>   >it is common wisdom the 2-3 standard deviations are enough for
>   >statistical significance.
...
Sean Eddy <eddy at wol.wustl.edu> writes:
> 
> And it's since been shown (papers by Karlin, Altschul, and others)
> that the reason for this is that the score distribution for local
> alignments is not a normal distribution. Z-scoring is unreliable,
> giving overestimates of how significant a score is. The score
> distribution is instead closer to an extreme value distribution, with
> a longer tail than the Gaussian.

It isn't so much that the Z-score is flawed, it's just that the 2 -
3 S.D.  rule does not apply for the extreme-value distribution.  To
estimate the probability of obtaining a score with a Z-value > z IN A
SINGLE COMPARISON you must use the following formula for the
extreme-value distribution:

	p(Z > z ) = 1 - exp ( - exp (-1.282555 * z - 0.577216))

However, remember that when you do a database search (or a Monte-Carlo
shuffle), you never compare your sequence only to the highest scoring
sequence, you compare it to 50,000 (database search) or (100 - 1,000 -
shuffled sequences.  So, the number of times you expect to obtain a
z-value > z after 50,000 comparisons is:

	E(Z > z) = N * p(Z > z), where N=50,000.

The probability of obtaining this E() (expectation) value can be calculated
using the poisson formula:
	p(Z > z: N comparisons) = 1 - exp (-E), where E is the
expectation calculated above.  (Note that this formula would also be
required if the probability distribution were Gaussian. The 2 -3 S.D.
rule applies only when you have done only one test.)

For the SwissProt protein sequence database (~50,000 entries), the
z-value = 6 recommended in the original posting would have a
probability of:

	p( Z > 6 ) = 2.554E-4

For N=50,000, E() = 2.554e-4 * 50,000 = 12.7, which means that a score
with a Z-value of 6 is expected about 13 times per database search and
thus the similarity score is not unexpectedly high (not statistically
significant).

In a Monte-Carlo shuffle done 100 times, the E() = 2.554E-4 * 100 =
0.0255, which is not very signifcant either.  But, since one almost
never finds a sequence by doing only 100 comparisons, the Monte-Carlo
expectation value is artificially low.

I typically do 500 - 1,000 shuffles, and then report an expectation
value based on the search of a database of 10,000 - 50,000 sequences.

Bill Pearson




More information about the Mol-evol mailing list