My reply to Mat w/o attachment

Glen M. Sizemore gmsizemore2 at
Sun Nov 17 13:46:09 EST 2002

Mat: Now the tone of this reply does seem again to be descending into
promotion of your area of work and again expose this open wound you
seem to have about behavioural science not being considered the
vanguard of discovery. Anyhow...

GS: 1.) The style of research that I (and many others; but, true, a
minority) is not limited to either the type of questions I ask, or
behavioral research in particular. 2.) The fact that the science of
behavior - as viewed by many behavior analysts - is not as widely recognized
as it should be is somewhat tragic (especially from the applied side) but
that is irrelevant. I was just describing how the method works in the
context of the scientific endeavor that I understand the best.

> 1.) a p-value is a conditional probability of the form p(A/B) where A is
> observation and B is the truth of the null hypothesis.
> 2.) you don't know if B is true or false.
> Conclusion: whatever a p-value is, it cannot be a quantitative assessment
> the truth of B because the meaning of the p-value is dependent on B and
> don't know what B is. Now attack the premises or the conclusion. I dare
> It is like telling a blind man that when the blue light flashes, there is
> 0.1 probability he will get shocked if he doesn't press a lever. This is
> but useless to him, in any quantitative sense, because he doesn't know if
> the blue light flashes.
> And no you can't say that it is really an assessment of the truth of the
> null hypothesis because that is claiming that p(A/B)=p(B/A). I remind you
> again of that below. No, no, don't thank me, Mat, just try to become a
> scientist.

Mat: You are accusing scientists of something they are not guilty of doing.
You claim that this assertion (p(A/B)=p(B/A)) is being made when it
is not. Every scientist knows the mathematical basis of p values, and
their conclusions are pragmatic interpretations of the result. If you
assume two dice are fair then the null hypothesis is there should be
no difference in the number of sixes thrown by each. In a 100 trials,
if one die gave 17 sixes, while the other gave 82 you would end up
with a significant difference. Now you have two possible situations -
either the null hypothesis is true and you have just witnessed a
highly improbable series of events, or the null hypothesis is
incorrect and one of the dice is biased. No one is claiming that the
probability of getting the 82 sixes equates to the probability that
the null is true but that given a p<0.05, the situation in which we
find oursevles is sufficently improbable for us to be pragmatic and
say the null is not true.

GS: Which is inconsistent with nothing I have said. Let's leave aside the
issue of what actually does and does not occur in significance testing
(nobody ever comes up with their hypothesis after the data are collected,
right Mat?), which was not the original emphasis. I stated originally that:

1.) Tests of statistical significance do not provide a quantitative estimate
of the reliability of the result.

2.) Tests of statistical significance do not estimate the probability that
the results were due to chance.

3.) Tests of statistical significance usually do not answer a question to
which the answer is unknown."

To which you replied:

was there any reasoning behind these statements? any mathematics to
back it up or did he just 'say' and 'argue' the above? Unless he gave
a rigourous mathematical proof that the above are correct then it is
pointless arguing about them as statistics lies in the realm of
mathematics, obviously. You don't 'discuss' mathematics.

The first assumption is incorrect even before we begin to debate.
Statistics churns out numbers, so it is by definition quantitative.
What those numbers actually mean is another matter.

The second point - is the argument that the procedures are incorrect
(i.e. the algorithm) or that the underlying basic assumptions are
incorrect (e.g. normal distribution). If it is the former, then again
its rubbish, if its the latter then this argument is well known and he
presents nothing new.

GS: Remember that Mat? Here your argument certainly appears to be that what
I say was wrong (and do note your tone). But now, of course, you are arguing
that "scientists don't do this." This last statement you make specifically
in reference to my argument that, to say that a "p value represents the
probability that we will get data = or more extreme than the obtained sample
observation" by chance, is the same as saying p(B/A). Now, I'm sure you will
agree that :

1.) a p value is given by (pA/B), where a is data= or more extreme, and b is
the null hypothesis. The "/" is, of course, the conditional sign.

2.) to say that a p value represents the probability that you will get those
data by chance is to say that it quantitatively reflects the probability
that the null hypothesis is false, which is p(A/B)=p(B/A).

Note that at first you argued, implicitly, that the argument presented above
was false, since you attempted to make an argument that what I was saying
was correct only when certain well-known violations occurred. Let me refresh
your memory, Mat: "The second point - is the argument that the procedures
are incorrect (i.e. the algorithm) or that the underlying basic assumptions
incorrect (e.g. normal distribution). If it is the former, then again its
rubbish, if its the latter then this argument is well known and he presents
nothing new."

GS: But now, and I love this part, Mat. You want to say that "scientists don
't do this," but (and here's where the irony comes in, Mat) you apparently
made exactly that error when you emitted the quote above! Either one of two
things are true: either you didn't read what I wrote, or you didn't see that
the statement "Tests of statistical significance do not estimate the
probability that the results were due to chance" was correct. Right?

But let's return to the first point of which you made no mention here. You
recall that one, right Mat? That was: Tests of statistical significance do
not provide a quantitative estimate of the reliability of the result. Your
reply was: "The first assumption is incorrect even before we begin to
debate. Statistics churns out numbers, so it is by definition quantitative.
What those numbers actually mean is another matter." Now what exactly does
this mean? Given you the benefit of the doubt, we can say that you simply
missed the point. That is, you confused my actual statement with the
statement: "Tests of statistical significance are not quantitative." But
that, of course, was not what I said. One can only surmise that you were
misreading again, so it behooves me to make sure you understand the point. A
p value has quantitative meaning only if B (as defined above) is true. This
is true by the definition of a p value. To reject the null and turn around
and assert that the number has any quantitative meaning is quite absurd. But
remember, the main point was that it does not reflect (at least in any known
way) the probability that the results would be obtained again. Now, the
question is, are you taking umbrage with this assertion?

> Mat: doh! you are doing the very same as the people you chastise! by
> repeating the experiments you are increasing your n, such that if there is
> true difference it should become apparent.
> GS: Nonsense. What I am doing, and what others like me do is directly
> demonstrating the reliability. That's why it is not unheard of to publish
> data collected and analyzed "individual-subject style" with 3 subjects.
> such data are, as I explained, generally proven to be reliable through
> direct and systematic replication. What "thinkers" like you do is increase
> the N because doing so will almost always result in differences even if
> "effect" is virtually nonexistent (see below).
Mat: But in clinical sciences the situation is simply not the same, as you
must begin to appreciate. The differences between drugs are simplynot as
extreme as the evidence you collect. If you use two modern
drugs on 3 patients each you will in no way get an accurate picture of
the actual efficacy.

GS: You have to back this up. I agree that three is a rather low number in a
circumstance like a clinical trial, but it is not clear that some version of
the method I am suggesting cannot be used (more on that below).

Mat: The reasons drug trials have to be done on a
population basis are many and varied but include averaging out factors
such as age, severity of disease, comorbidity etc. - all factors which
can be controlled in the laboratory.

GS: This doesn't really matter if one is in a position to determine effects
in individual subjects. If one can do this, one can simply state the number
looked at, the number of improvements, well....need I go on?

Mat: Also if a drug is only marginally better, then this will only bear out
over a large sample -
it doesn't make the difference any thes less important.

GS: This is very close to saying that the size of the effect (and relatedly,
the number of subjects actually showing an effect from an individual
standpoint - and this, of course, is frequently not published or not
meaningful because subjects were either "exposed or not exposed or exposed
to placebo") doesn't matter, all that matters is that you have the p value
to tell you that you have obtained truth.

Mat: If in your experiments you used a dose 'x' of cocaine and in another
set used a dose '0.95x' do you think you could detect which were which given
three animals treated with each? There should be a difference, but it might
take you ten, twenty,a hundred animals to get enoguh data to
clearly see a trend.

GS: But if you have an idea about the shapes of the dose-effect functions in
individual subjects, the sort of data you're talking about becomes somewhat
trivial. And in medicine, ultimately what matters is the characteristics of
individual-subject functions, as well as a particular subject's function.
Anyway, it is an interesting point - if one knew the general characteristics
of the individual-subject functions, one would know that there might be
places on the dose-effect function where a small dose difference would be
detectable (0.95x is a little on the absurd side though, Mat, and that is
why doses are frequently increased according to a logarithmic function) and
others where they probably wouldn't. But we know, also, from these
functions, that it should be possible to detect very, very, small
differences, given that we add enough subjects. Sound familiar, Mat? The
situation is quite apropos - we already know that a difference is detectable
and obtaining significance by adding subjects is completely trivial.


> Mat: No, changing the assertion is not allowed as any decent statistician
> will tell you. the p value is categorically not a probability that any
> hypothesis, null or otherwise, is true. You don't actually understand this
> do you?
> GS: I didn't say the assertion was changed. Can you not read? What I said
> was saying that a p-value expresses the probability that the data were due
> to chance is tantamount to reversing the conditionality. Again, a p-value
> a conditional probability of the form p(A/B) where A is the observation,
> B is the truth of the null hypothesis. When one asserts that it really is>
the probability that the null hypothesis is true given the observation, one
> is asserting that p(A/B) is the same as p(B/A). And this, my arrogant,
> ignorant "friend," is quite simply incorrect.

Mat: But this assertion is not made.

GS: I'm sure you see why this is funny, now. Even without your own lapse of
judgement, I'm sure, now that "your consciousness has been raised," you will
notice the many times that decisions are made based on one or more of the
fallacies, and their subsidiaries, outlined above.

Mat: No-one is saying the the equivalence
you make is true, only that having found your self in an unlikely
situation (i.e. having a low p value) the opposite to the null
hypothesis intuitively is likely to be the case. In childhood
leukaemia, drug regimens have evolved through clinical trials and the
cure rate is at an all time high. The trials undoubtably relied on p
values. You claim p values are useless but you ignore the weight of
evidence in the improvements in medical treatment to show their

GS: This is naive. Many things contributed to the cure rate, including
sciences that were developed without significance testing. But there are
other things wrong. Some amount of success is not the same as showing that
the methodology is superior to other proposed tactics. Ok, so significance
testing is not completely worthless, and doesn't completely subvert the
scientific process.

Mat: You state that increasing the N to obtain significance is a perversion
of science - do you therefore claim that when significance is reached
it is an artefact and that there is no underlying difference? If a
difference is slight, large numbers will be needed to detect it, and
then its clinical relevance is actually questioned. The 'number
needed to treat' is used as a measure of how useful a new (and more
expensive) drug is likely to be. You claim that the majority of
scientists are in error but you bury your head to the facts of the

GS: Again, the fact that progress has been made does not show 1.) that the
mistaken notions that I emphasized from the beginning do not occur, and it
certainly has nothing to do with the truth of the 3 main assertions. The
case of detecting differences between dose x and dose .95x is an example of
an experiment that does not need to be done because we already know that
eventually we will be able to detect the difference - and we know this if we
have direct evidence that the effect is dose -dependent in individual
subjects. Furthermore, if we know the effects are dose-dependent, and we don
't obtain a difference we don't know what to say because it is likely that
some subjects where impacted in one direction, and others in another
direction, and these cancel out. Does this mean that there is no effect?

Mat: Increasing your N does not actually always give you a significant
result (becuase, of course, there might be nothing going on). Take
for example the International Stroke trial, comprising a population of
30,000. No significant diference was found in those treated with
Aspirin for acute stroke to those treated as per current guidlines.

GS: So your argument is now that because the null hypothesis is occasionally
not rejected my argument is false? Plus....why should I not seek still more
subjects? Are you sure a difference would not be found in 300,000 (and, yes,
I know about power calculations)? The fact is that one may turn an
experiment from a success to a failure simply by adding enough subjects, and
this may be done without increasing one's experimental control (where it is
relevant, in primarily experimental sciences) or one's understanding of the
subject matter (where direct experimental control must be forgone). And
since papers that do not have small p values in the Results section do not
get published the pressure is extreme to do just that. And I am not the only
one who suspects that it is widespread. Incidentally, a related issue is
that a paper that actually has suffered a Type I error, is likely to get
published, but the other 99 that didn't saw only the inside of a wastepaper
basket. In sciences where people are not impressed with your effect until it
is demonstrated in all or most of a few to several subjects, one must
improve one's control over the subject matter in order to show the
reliability of the result, and one has simultaneously provided data
concerning the generality of the effect. In the monkey example I described,
one may say that the effects of delayed reinforcement show good species
generality, but the effects can depend on the characteristics of the IRT
distribution (which, of course, we know how to control from other
experiments that demonstrated the reliability of the effects of IRT
contingencies on the distribution).

> Mat: It does not tell you what is true or not true of the whole
> The conclusions drawn are tentative inferences based on the stats. The
> arbitrary limit is set at 95% and above this we claim that we have good
> enough evidence to act as though the null hypothesis is not true of the
> general population - it still may be true, we will never know. All we can
> is act according to the best available evidence. Its modern science, and
the> approach has improved healthcare dramatically.
> GS: You are seriously confused.

Mat: Or maybe you don't actually know what goes on in clinical science.
Maybe you should just get on and do your really useful research seeing
how rats react to cocaine while the clinical scientists get on with
the business of actually improving treatments for you and the rest of

Glen: now basic experimentation with non-human animals is
worthless, right? Besides, it doesn't get at any of the points I made above.
But, anyway, I do bow down before you oh "He who saves lives!"

> GS: Think about this: if you have a drug that produces large effects in
> of the sample, and no effect in the other 60%, one could obtain
> significance if one increased the N enough. So now we have an effect that
> works in only 40% of the population and it is deemed important and
> If you are dying, you might want to try it, but only an insipid idiot
> call it reliable. Yet this is, apparently, your version of "modern
> But, of course, in most experiments, not even the researcher may know how
> many of his subjects actually showed an effect. All he or she may know
> (because that is all they are paying attention to) is that p<.01. And
> certainly the reader usually has no clue as to how many of the subjects
> actually "had" the "effect." In medical research, fortunately, there is
some> pressure to pay close attention to the individual effects (BTW, Mat,
if it
> is possible to judge an effect in an individual, what do you need
> for?) . However, I argue, and occasionally some enlightened MD argues,
> significance testing is dangerous. Sometimes you have nothing else but
> often you do.

Mat: And you think this is actually how things are done? that such
markedly different efficacies between patient subgroups are not
identified? Who are you trying to kid?

GS: I am certain that it is done, perhaps more in basic research, and here
it is rampant, especially from the standpoint of those who don't consider an
effect reliable unless it is demonstrated repeatedly within and across
subjects. And let me reiterate, in the vast majority of basic papers that
you read in a bunch of different sciences, there is no indication of how
many subjects "actually showed an effect" or even anyway to determine it
(you have a chance, at least, when the "design" is repeated measures). And
yes, if the data were graphically portrayed in a variety of ways, showing
distributions, confidence intervals, ranges, even raw data, one may obtain
an inkling of "what really happened" but the reliance on the p value as the
near sole arbiter (and this is the case if the publication of a paper
depends, in practice, mostly on obtaining significance) has all but put an
end to such direct, exploratory data analysis. I spend large amounts of my
time trying to piece together, usually without success, what actually went
on in experiments where the resulting papers are filled with F-ratios and p
values, but no way to tell how many subjects showed any effect at all (if
this is even definable - as it is NOT in experiments where different groups
of subjects have been exposed to a single dose). Where you can obtain such
data, demonstrating effects repeatedly in 4 of 4 subjects goes along way in
establishing the reliability (if not generality, when the subjects, are of
different "kinds") than the report of group means and p values with 50

Mat: It is certainly the case that different treatments can be
significantly different statistically and yet the difference be of
questionable clinical value. But that is not a problem of the stats,
merely of their interpretation.

GS: As are all of the problems I am talking about. And we haven't even
talked much about the notion that averaging across subjects may produce
numbers that have nothing to do with the individual-subject phenomena.

> Mat: When was it first demonstrated? Proof of this comes from where? Prove
> to me that there is any sort of difference between anything you can think
> of. Observed all of them?
> GS: Science, in general, demonstrates that "something nonrandom is almost
> always going on." Does that answer your question?

Mat: err, surprisingly not.

GS: It surprises me given the open-minded way you responded. And I must,
once again, express my distaste for your tactics of argumentation in this
forum. But more to the issue (that is leaving aside your arrogance) I assert
again - to use but one example - that if one does experiments with drugs
that are known to be active with respect to some measure, one will be likely
to find some difference if the number of subjects is increased sufficiently.

Of course, I am as vehement about my views as you are about yours, but then
I am familiar with the points made by the growing minority of scientists
that see trouble with significance testing, as well as the "standard view,"
and have had to demonstrate understanding of both because of academic
requirements. But you have been merely indoctrinated. If I have been
indoctrinated, at least it was by both sides!

Anyway, are you sure that there is no difference between aspirin and the
formerly standard protocols? At N=300, 000; 3,000,000? In any event, if you
are testing different drugs, there is a very, very, good chance that you
will eventually obtain differences. Perhaps not all the time but, as you
know, even subtle differences in the pharmacological actions (say a slight
difference in receptor sub-type selectivity) is very likely to translate
into group differences at some N. It may take a large number of subjects,
but I submit that it is very, very, likely that different drugs will produce
differences if enough subjects are used. In addition, the rejection of the
null hypothesis is ass-backwards from how the more powerful sciences
evolved. It is not clear what meaning we should ascribe to the theories
behind sciences were one's theory is said to be supported by the rejection
of the strawman-nil-null-hypothesis. In other sciences that have developed
enough for the hypothetico-deductive method to be worthwhile, the scientists
seek to reject their own hypothesis. They do everything in their power to
reject it. In contrast, we have the standardized rejection of the strawman
hypothesis. And to claim that the alternative is supported given the
rejection of the null is ludicrous. Even if there are only two competing
theories developed (and of course the "amount" that H1 is "strengthened"
presumably depends on the number of competing theories with which the
obtained "result" is not incompatible) there is no reason to suspect that
there are not many - if not infinitely many - possible alternatives that we
(in our self-professed ignorance) have not come up with. If science is about
the unknown, it is ironic that what we currently assert is given so much
stock. We're in the dark, oh! but there are three theories that we have...
and we obtain a finding incompatible with one of them (since it predicts "no
difference"). Now what does this really say about the likelihood of the
others if, in our ignorance, we have not come up with 12 others? Now, say
all three theories make different quantitative predictions and only one of
them is close, an it is very close, given the range of variability we have
seen. This is hypothesis testing; what is done in the name of inferential
statistics in (broadly speaking) much of the biological sciences is a
complete perversion.

"mat" <mats_trash at> wrote in message

More information about the Neur-sci mailing list