Some myths concerning statistical hypothesis testing

Glen M. Sizemore gmsizemore2 at
Fri Nov 8 10:28:55 EST 2002

RV: Glenn,
your posts are couched in a somewhat hostile manner which does not
encourage others to join in or to ask for clarification.

GS: Not a word about the way Mat treated me? Why? Perhaps I escalated a bit,
but that is how one responds to an ad hominem. You'll notice that I am civil
to you, although we disagree. I have a long history with Mat. His favorite
tactics are saying "rubbish" and not responding to what one wrote, as well
as simply calling me stupid or uninformed.

> 1.) a p-value is a conditional probability of the form p(A/B) where A is
> observation and B is the truth of the null hypothesis.
> 2.) you don't know if B is true or false.
> Conclusion: whatever a p-value is, it cannot be a quantitative assessment
> the truth of B because the meaning of the p-value is dependent on B and
> don't know what B is. Now attack the premises or the conclusion. I dare

RV: So p is the probability of B given A.

GS: No, it is the probability of A given B. That's the whole point.

RV: I am not sure where truth comes
into it. But the quantitative assessment is the conditional probability.
I know what B is (typically, that the two results are sampled from a
common population that is normally distributed, homoscedastic etc), I
just don't know if it is true.

GS: The meaning of the p-value is quantitatively accurate only if the
conditional portion is "there."
> > Marc does not say what follows in his paper, but this > misconception
> produced a state of affairs in which a great deal of > importance is
> attached to findings before it is clear that the finding is > reliable.
> result is that there is, all things being equal, a great deal > of
> discrepant results in the various scientific literatures that rely on >
> statistical significance testing. In contrast, for sciences in which the >
> reliability is demonstrated in each subject (usually repeatedly), or >
> "subject" if the preparation is not a whole animal, there is far less >
> failure to replicate (this is because such data are published only when >
> there have been numerous demonstrations of reliability within and across >
> subjects). For an example of how this is done, you may examine my paper:
> > Effects of Acutely Administered Cocaine on Responding Maintained by a >
> Progressive-ratio Schedule of Food Presentation, which is in press in >
> Behavioural Pharmacology. Or, you may examine virtually any paper in the >
> Journal of the Experimental Analysis of Behavior. Or you may obtain a copy
> of Sidman's Tactics of Scientific Research, or even Claude Bernard's
> > book. >
> Mat: doh! you are doing the very same as the people you chastise! by
> repeating the experiments you are increasing your n, such that if there is
> true difference it should become apparent.
> GS: Nonsense. What I am doing, and what others like me do is directly
> demonstrating the reliability. That's why it is not unheard of to publish
> data collected and analyzed "individual-subject style" with 3 subjects.
> such data are, as I explained, generally proven to be reliable through
> direct and systematic replication. What "thinkers" like you do is increase
> the N because doing so will almost always result in differences even if
> "effect" is virtually nonexistent (see below).

RV: Glenn, it is vey dependendent upon what you work on. I record single
neurons. They just don't hang around long enough to do a lot of repeated

GS: You mean in vitro? Are you saying you can't get a baseline, introduce a
variable, and then withdraw the variable and then introduce it again? I don'
t understand.

RV: I also can't see why you prefer 3 people tested 5 times to 15
people tested once, unless you need trained subjects, or you want to look
at intra and inter-subject variability which might be important for some
things. For many clinical trials the patient gets better with treatment,
and it is not ethical to make them sick again ;-)

GS: Because I am interested in directly demonstrating the reliability of the
effect within and across subjects. In the paper I just got published 4 of
the 5 rats showed increases in breakpoint (the rats must "pay" leverpresses
for food and the number that they must pay increases after each food
delivery - after some point they stop responding and the "price" they "paid"
is the breakpoint) at some dose of cocaine every time it was administered.
So if each dose was given three times, and 4 of the 5 rats showed a
consistent increase at some dose, the fact that cocaine can increase
breakpoint was replicated many times before the end of the experiment. I can
guarantee you that if you arrange such schedules with rats (and no doubt
many other species) you will be very likely to find some dose that increases
breakpoint in almost every subject. And indeed, I have replicated the result
in an off-hand probe, and have also observed similar effects with tropane
analogs. If I gave 15 rats one dose, I would not know if the increases were
reliable at all. Indeed, in the rat that did not show the effect reliably, I
did produce increases with the first administration of, I think, 30 mg/kg,
but could not produce any increases after that. Incidentally, that rat was
similar to the others on a couple of different measures.

I wasn't talking about ethics, or even medical research in general (until
Mat brought it up) but since you bring it up, why is it any less ethical to
temporarily stop treatment than to give sick people placebo? That way
everybody gets the drug, gets better, gets temporarily sicker when you
withdraw the drug and start injecting vehicle, and then gets better again
when you determine that they are getting worse and you reintroduce the drug.

> averaged together. And if, say, only two subjects showed the effect in
> question, I wouldn't publish the data, but I would strongly suspect that
> there was something worth pursuing, and I might try to figure out why I
> the effect in only two of the animals.

RV: Surely this depends on what the effect is. Aren't there a small
proportion of people who are HIV positve but never develop AIDS. Even if
they were 2 out of 100, they would be worth investigating. This is
really to do with being a good scientist, not a stats abuser. I don't
think anyone is disagreeing with this. This is a very different
situation from a controlled randomized trial where you are not exploring,
but simply testing a simple hypothesis.

GS: It is common for researchers to simply add subjects until significance
is reached. Abuse of stats is exactly what I am complaining about.

> GS: No, it doesn't. It tells you that IF THE NULL HYPOTHESIS IS TRUE
> you don't know) there is a 5% or 1% chance of obtaining the data again.
> Since you don't know if the null hypothesis is true or not, you have no
> quantitative assessment of the likelihood of obtaining the observation,

RV: But you're not interested in the likelihood of getting the observation -
you already have it.

GS: What one should be interested in is the reliability and generality of
the finding. Repeatedly testing a few subjects can directly demonstrate the
reliability as do direct replications in other laboratories, and systematic
replication directly demonstrates the generality. Through this means the
facts uncovered by the experimental analysis of behavior are among the most
highly replicable in psychology. There are many, many effects that are very
large and obtainable in virtually every subject. Some are pretty reliable
but known to fail in the occasional subject. This is true of the
cocaine-induced increases in breakpoint, as well as a couple of other
cocaine effects, as well as, I'm sure, a few others. One might be able to
track down why they are different (as I explained with respect to the monkey
experiment - in this case what was at issue was not why a few subjects didn'
t show the effect but, rather, why most monkeys did not show an effect that
is quite reliable in rats and pigeons, but the principles are the same.

In any event, what I am complaining about is the rather widespread notion
that a small p-value is a quantitative estimate of the reliability of the

RV: The issue is that if the likelihood of getting the
data was small given that the null hypothesis is true, we choose to take
a punt and say the null hypothesis is likely not true.

GS: I'm well aware of that.

> GS: Think about this: if you have a drug that produces large effects in
> of the sample, and no effect in the other 60%, one could obtain
> significance if one increased the N enough. So now we have an effect that
> works in only 40% of the population and it is deemed important and
> If you are dying, you might want to try it, but only an insipid idiot
> call it reliable. Yet this is, apparently, your version of "modern
> But, of course, in most experiments, not even the researcher may know how
> many of his subjects actually showed an effect. All he or she may know
> (because that is all they are paying attention to) is that p<.01. And
> certainly the reader usually has no clue as to how many of the subjects
> actually "had" the "effect." In medical research, fortunately, there is
> pressure to pay close attention to the individual effects (BTW, Mat, if it
> is possible to judge an effect in an individual, what do you need
> for?) . However, I argue, and occasionally some enlightened MD argues,
> significance testing is dangerous. Sometimes you have nothing else but
> often you do.

RV: Aren't we all on the same page? You plot the data. You look for sub-
groups and weird effects. You can test for some of these properties. if
everything looks like a homogeneous group then you can do some
inferential stats on them. In your example, the data would have two
peaks (at 0 and +x% effect) and would not be normally distibuted. Anyone
testing this without caution is an idiot, but it does not make the
statistical tests wrong.

GS: I doubt we're all on the same page. Anyway, I am not attacking
statistical theory, I am attacking the unreasoned and nearly ubiquitous
reliance on significance testing, as well as misconcetions.

> GS: Usually the null hypothesis is, in the simplest case, that there is no
> difference between the control group (or control condition as in the
> t-test, which is the simplest form of repeated-measures ANOVA; hehehe) and
> experimental group. So, yes, if you are doing ANYTHING it is likely to
> SOME effect, and if you throw enough subjects at it, you will eventually
> reach a point where you "obtain statistical significance." This is, in
> usually what happens in the sort of "science" you are talking about. BTW,
> physics and many other sciences, what functions as the null hypothesis is,
> in fact, the scientist's own prediction! That is, the scientist does
> everything in his or her power to reject their own prediction, and when
> does not occur they begin to assert the importance of their hypothesis. In
> contrast, "scientists" like you do everything in their power to reject the
> stawman notion that there is no effect which, as I have pointed out, is
> almost certain to be false.

RV: Come on Glenn, I don't think that too many papers are pointing out a 5%
difference even if it is significant at p<0.001.

GS: Oh really?

RV: Maybe you've had a bad
experience lately you want to share? Clinical significance involves the
idea that the effect is worth risking a change in therapy and so must be
a substantial improvement (not 5%) as well as a statisically significant

GS: In the basic laboratory in a lot of sciences, something gets published
if significance is obtained, and almost never gets published if no
significance is obtained, no matter how large the effect appears visually.
This is well known.

"Richard Vickery" <Richard.Vickery at> wrote in message

More information about the Neur-sci mailing list