In article <6anmv7$osi at net.bio.net> Iddo Friedberg <idoerg at cc.huji.ac.il> writes:
>This Monte-Carlo strategy of evaluating alignment scores is being used
>routinely in the GCG sequence alignment programs. Basically, the idea is
>as you stated it. Once you make, say, 100 randomizations, you get a
>normal distribution of scores (vs. the random) with a given mean, and
^^^^^^^^^^^^^^^^^^^
>standard deviation. In my group, we use the rule-of-thumb that if the
>non-random score is >6 S.D. above the random score, then there might be
>some biological significance. This seems like a bit of a harsh rule, as
>it is common wisdom the 2-3 standard deviations are enough for
>statistical significance. However, it was empirically found (Science,
>1991, D. Eisenberg, can't remember more than that, but should be enough
>for a Medline search), that 6 S.D is a good rule....
And it's since been shown (papers by Karlin, Altschul, and others)
that the reason for this is that the score distribution for local
alignments is not a normal distribution. Z-scoring is unreliable,
giving overestimates of how significant a score is. The score
distribution is instead closer to an extreme value distribution, with
a longer tail than the Gaussian. Bill Pearson's FASTA/SSEARCH software
package is an example of a package that lets you do Monte Carlo
estimation of alignment significance using the extreme value
distribution.
--
- Sean Eddy, Ph.D.
- Dept. of Genetics, Washington University School of Medicine
- 660 S. Euclid Box 8232, St. Louis MO 63110, USA
- mailto://eddy@genetics.wustl.eduhttp://genome.wustl.edu/eddy