Some myths concerning statistical hypothesis testing

Richard Vickery Richard.Vickery at
Fri Nov 8 08:06:57 EST 2002

your posts are couched in a somewhat hostile manner which does not 
encourage others to join in or to ask for clarification.

> 1.) a p-value is a conditional probability of the form p(A/B) where A is the
> observation and B is the truth of the null hypothesis.
> 2.) you don't know if B is true or false.
> Conclusion: whatever a p-value is, it cannot be a quantitative assessment of
> the truth of B because the meaning of the p-value is dependent on B and you
> don't know what B is. Now attack the premises or the conclusion. I dare you.

So p is the probability of B given A.  I am not sure where truth comes 
into it.  But the quantitative assessment is the conditional probability.  
I know what B is (typically, that the two results are sampled from a 
common population that is normally distributed, homoscedastic etc), I 
just don't know if it is true.

> > Marc does not say what follows in his paper, but this > misconception has
> produced a state of affairs in which a great deal of > importance is
> attached to findings before it is clear that the finding is > reliable. The
> result is that there is, all things being equal, a great deal > of
> discrepant results in the various scientific literatures that rely on >
> statistical significance testing. In contrast, for sciences in which the >
> reliability is demonstrated in each subject (usually repeatedly), or >
> "subject" if the preparation is not a whole animal, there is far less >
> failure to replicate (this is because such data are published only when >
> there have been numerous demonstrations of reliability within and across >
> subjects). For an example of how this is done, you may examine my paper: The
> > Effects of Acutely Administered Cocaine on Responding Maintained by a >
> Progressive-ratio Schedule of Food Presentation, which is in press in >
> Behavioural Pharmacology. Or, you may examine virtually any paper in the >
> Journal of the Experimental Analysis of Behavior. Or you may obtain a copy >
> of Sidman's Tactics of Scientific Research, or even Claude Bernard's classic
> > book. >
> Mat: doh! you are doing the very same as the people you chastise! by
> repeating the experiments you are increasing your n, such that if there is a
> true difference it should become apparent.
> GS: Nonsense. What I am doing, and what others like me do is directly
> demonstrating the reliability. That's why it is not unheard of to publish
> data collected and analyzed "individual-subject style" with 3 subjects. And
> such data are, as I explained, generally proven to be reliable through
> direct and systematic replication. What "thinkers" like you do is increase
> the N because doing so will almost always result in differences even if the
> "effect" is virtually nonexistent (see below).

Glenn, it is vey dependendent upon what you work on.  I record single 
neurons.  They just don't hang around long enough to do a lot of repeated 
measures.  I also can't see why you prefer 3 people tested 5 times to 15 
people tested once, unless you need trained subjects, or you want to look 
at intra and inter-subject variability which might be important for some 
things.  For many clinical trials the patient gets better with treatment, 
and it is not ethical to make them sick again ;-)

> averaged together. And if, say, only two subjects showed the effect in
> question, I wouldn't publish the data, but I would strongly suspect that
> there was something worth pursuing, and I might try to figure out why I got
> the effect in only two of the animals. 

Surely this depends on what the effect is.  Aren't there a small 
proportion of people who are HIV positve but never develop AIDS.  Even if 
they were 2 out of 100, they would be worth investigating.  This is 
really to do with being a good scientist, not a stats abuser.  I don't 
think anyone is disagreeing with this.  This is a very different 
situation from a controlled randomized trial where you are not exploring, 
but simply testing a simple hypothesis.

> GS: No, it doesn't. It tells you that IF THE NULL HYPOTHESIS IS TRUE (which
> you don't know) there is a 5% or 1% chance of obtaining the data again.
> Since you don't know if the null hypothesis is true or not, you have no
> quantitative assessment of the likelihood of obtaining the observation,

But you're not interested in the likelihood of getting the observation - 
you already have it.  The issue is that if the likelihood of getting the 
data was small given that the null hypothesis is true, we choose to take 
a punt and say the null hypothesis is likely not true.

> GS: Think about this: if you have a drug that produces large effects in 40%
> of the sample, and no effect in the other 60%, one could obtain statistical
> significance if one increased the N enough. So now we have an effect that
> works in only 40% of the population and it is deemed important and reliable?
> If you are dying, you might want to try it, but only an insipid idiot would
> call it reliable. Yet this is, apparently, your version of "modern science."
> But, of course, in most experiments, not even the researcher may know how
> many of his subjects actually showed an effect. All he or she may know
> (because that is all they are paying attention to) is that p<.01. And
> certainly the reader usually has no clue as to how many of the subjects
> actually "had" the "effect." In medical research, fortunately, there is some
> pressure to pay close attention to the individual effects (BTW, Mat, if it
> is possible to judge an effect in an individual, what do you need statistics
> for?) . However, I argue, and occasionally some enlightened MD argues, that
> significance testing is dangerous. Sometimes you have nothing else but quite
> often you do.

Aren't we all on the same page?  You plot the data.  You look for sub-
groups and weird effects.  You can test for some of these properties.  if 
everything looks like a homogeneous group then you can do some 
inferential stats on them.  In your example, the data would have two 
peaks (at 0 and +x% effect) and would not be normally distibuted.  Anyone 
testing this without caution is an idiot, but it does not make the 
statistical tests wrong.

> GS: Usually the null hypothesis is, in the simplest case, that there is no
> difference between the control group (or control condition as in the paired
> t-test, which is the simplest form of repeated-measures ANOVA; hehehe) and
> experimental group. So, yes, if you are doing ANYTHING it is likely to have
> SOME effect, and if you throw enough subjects at it, you will eventually
> reach a point where you "obtain statistical significance." This is, in fact,
> usually what happens in the sort of "science" you are talking about. BTW, in
> physics and many other sciences, what functions as the null hypothesis is,
> in fact, the scientist's own prediction! That is, the scientist does
> everything in his or her power to reject their own prediction, and when this
> does not occur they begin to assert the importance of their hypothesis. In
> contrast, "scientists" like you do everything in their power to reject the
> stawman notion that there is no effect which, as I have pointed out, is
> almost certain to be false.

Come on Glenn, I don't think that too many papers are pointing out a 5% 
difference even if it is significant at p<0.001.  Maybe you've had a bad 
experience lately you want to share?  Clinical significance involves the 
idea that the effect is worth risking a change in therapy and so must be 
a substantial improvement (not 5%) as well as a statisically significant 

Yours in enquiry

Richard Vickery

More information about the Neur-sci mailing list