# Some myths concerning statistical hypothesis testing

mat mats_trash at hotmail.com
Thu Nov 7 15:28:21 EST 2002

```> The first is not an assumption,
> it is a fact. A p-value expresses a conditional probability. That is, a
> p-value expresses the probability of obtaining the observation in question
> GIVEN THAT THE NULL HYPOTHESIS IS TRUE. Since it is not known whether or not
> the null hypothesis is true (at least ostensibly, but see below), the notion
> that a small p-value means the finding is highly likely to be replicated is
> clearly false.

Rubbish, its not just wrong - it doesn't even make sense.  Lets say we
have two groups of patients with the same disease, one is given
treatment A, one B.  The initial assumption is that there is no
difference in the efficacy of the treatments (null hypothesis).  We
set up a trial to see if either patient group does any better.  As it
turns out B do better than A.  We apply a suitable test and get a
p<0.05.  Therefore there is only a 5% chance that what was found true
for the selected patients is not generally true of all patients with
the disease i.e. that in fact, for the population as a whole A=B.  Now
you are saying that you cannot know the null hypothesis is true
beforehand - but thats the whole point of the test!  Where does
arguing that you can't know whether the null is right or wrong
beforehand get you?

> The only way to demonstrate the reliability of data is to
> replicate the finding.

where in standard statistics does it claim otherwise?  The p value is
only a probability.

> Marc does not say what follows in his paper, but this
> misconception has produced a state of affairs in which a great deal of
> importance is attached to findings before it is clear that the finding is
> reliable. The result is that there is, all things being equal, a great deal
> of discrepant results in the various scientific literatures that rely on
> statistical significance testing. In contrast, for sciences in which the
> reliability is demonstrated in each subject (usually repeatedly), or
> "subject" if the preparation is not a whole animal, there is far less
> failure to replicate (this is because such data are published only when
> there have been numerous demonstrations of reliability within and across
> subjects). For an example of how this is done, you may examine my paper: The
> Effects of Acutely Administered Cocaine on Responding Maintained by a
> Progressive-ratio Schedule of Food Presentation, which is in press in
> Behavioural Pharmacology. Or, you may examine virtually any paper in the
> Journal of the Experimental Analysis of Behavior. Or you may obtain a copy
> of Sidman's Tactics of Scientific Research, or even Claude Bernard's classic
> book.
>

doh! you are doing the very same as the people you chastise!  by
repeating the experiments you are increasing your n, such that if
there is a true difference it should become apparent.  Just becuase
you don't apply a t-test and get a p value doesn't mean you aren't
doing the same thing.  If six animals respond to cocaine and six don't
to placebo, the implicit message is that you'll get low p value.  When
datasets of tens of thousands are involved you need tools to
summarise.  What would you say if mice 1 3 and 5 responded to cocaine
and no others did?  would you say cocaine does have an effect?  how do
you proceed to argue your case and produce a conclusive result?

> Mat: The second point - is the argument that the procedures are incorrect
> (i.e. the algorithm) or that the underlying basic assumptions are
> incorrect (e.g. normal distribution). If it is the former, then again
> its rubbish, if its the latter then this argument is well known and he
> presents nothing new.
>
> GS: Wrong. Remember that a p-value represents the probability that one will
> observe certain data given that the null hypothesis is true. If one asserts
> that the p-value is really the probability that the null hypothesis is true
> given the data (which is the same thing as saying it represents the
> probability that the observed data are "due to chance") is to "reverse the
> conditionality." As Marc says, this is tantamount to saying that the
> probability of rain given that it is cloudy is the same as the probability
> that it is cloudy given that it is raining. Think about it when your blood
> pressure returns to normal.

No, changing the assertion is not allowed as any decent statistician
will tell you.  the p value is categorically not a probability that
any hypothesis, null or otherwise, is true.  You don't actually
understand this do you? In the population under investigation, either
the null or proposed hypothesis is true.  What the stat test tells you
is the likelyhood of you again finding a significant difference if you
took another sample of the population and did the trial again.  It
does not tell you what is true or not true of the whole population.
The conclusions drawn are tentative inferences based on the stats.
The arbitrary limit is set at 95% and above this we claim that we have
good enough evidence to act as though the null hypothesis is not true
of the general population - it still may be true, we will never know.
All we can do is act according to the best available evidence.  Its
modern science, and the approach has improved healthcare dramatically.

>
> GS: Now it is my turn to use the term "rubbish" (here in the States, we
> usually call it "garbage," but BS is probably more appropriate). If you
> "obtain significance" you write a paper and submit it. If you do not, you
> throw the data in the garbage (sounds pretty damn "categorical" to me), or
> you just "increase the N" until you have found your "truth" (the fact that
> all you have to do usually to reject the null hypothesis is simply add more
> subjects should tell you something).

Just increase your n?! This is laughable.  If you had been taught any
stats you would understand that if there is only a very slight
difference in the efficacy of two drugs say, then a large sample size
will be needed so that the difference become apparent.  Lets say for
example than drug A makes 50% of people better, while drug B makes 52%
of people better.  Would you expect that if you chose ten people on
each drug you'd observe the difference?  What about a 100? Would you
be confident in being treated by a doctor who based his treatment on
the obervation the 20 other people he'd seen with your disease in his
career?

> You know this is true. But, in any
> event, you are not on the right track. The point is that the strawman null
> hypothesis is almost always not true. Marc, quoting Kraemer (ref. on
> request), writes, "something nonrandom is almost always going on, and it
> seems a trivial exercise to redemonstrate that fact."

When was it first demonstrated? Proof of this comes from where?  Prove
to me that there is any sort of difference between anything you can
think of.  Observed all of them?

> At the end of this
> section Branch concludes, "Perhaps it is not so bad that significance tests
> do not estimate the truth of the null hypothesis, because we already know
> that it is false.

Any null hypothesis is false?  Prozac is no good for heart attacks,
oops false!

Drug A is no better than drug B - false!? which one is better?
ah...well, ahem

If the origin of you and whoever elses dissatisfaction with p values
and the like lies in the fact the trials etc. often contradict, even
though they all publish a 'significant' p value, then aim your
contempt at the design of the trial, not the stats.

```