Dear all,Evaluating low identity scores

Sean Eddy eddy at wol.wustl.edu
Wed Jan 28 12:09:45 EST 1998


In article <6anmv7$osi at net.bio.net> Iddo Friedberg <idoerg at cc.huji.ac.il> writes:
  >This Monte-Carlo strategy of evaluating alignment scores is being used
  >routinely in the GCG sequence alignment programs. Basically, the idea is
  >as you stated it. Once you make, say, 100 randomizations, you get a
  >normal distribution of scores (vs. the random) with a given mean, and
   ^^^^^^^^^^^^^^^^^^^
  >standard deviation. In my group, we use the rule-of-thumb that if the
  >non-random score is >6 S.D. above the random score, then there might be
  >some biological significance. This seems like a bit of a harsh rule,  as
  >it is common wisdom the 2-3 standard deviations are enough for
  >statistical significance. However, it was empirically found (Science,
  >1991, D. Eisenberg, can't remember more than that, but should be enough
  >for a Medline search), that 6 S.D is a good rule....

And it's since been shown (papers by Karlin, Altschul, and others)
that the reason for this is that the score distribution for local
alignments is not a normal distribution. Z-scoring is unreliable,
giving overestimates of how significant a score is. The score
distribution is instead closer to an extreme value distribution, with
a longer tail than the Gaussian. Bill Pearson's FASTA/SSEARCH software
package is an example of a package that lets you do Monte Carlo
estimation of alignment significance using the extreme value
distribution.

-- 

- Sean Eddy, Ph.D. 
- Dept. of Genetics, Washington University School of Medicine
- 660 S. Euclid Box 8232, St. Louis MO 63110, USA 
- mailto://eddy@genetics.wustl.edu http://genome.wustl.edu/eddy




More information about the Mol-evol mailing list