The question looks simple, but your reflection around it shows that it is not that simple.
Actually, p-values are a relatively late addition to the theory of statistics. Computing a p-value without a computer is very tedious; this is why the only way to perform a statistical test until recently was to use tables of statistical tests, as I explain in this blog post. Because those tables were computed for fixed $\alpha$ levels (typically 0.05, 0.01 and 0.001) you could only perform a test with those levels.
Computers made those tables useless, but the logic of testing is still the same. You should:
- Formulate a null hypothesis.
- Formulate an alternative hypothesis.
- Decide a maximum type I error (the probability of falsely rejecting the null hypothesis) error you are ready to accept.
- Design a rejection region. The probability that the test statistic falls in the rejection region given that the null hypothesis is your level $\alpha$. As @MÃ¥nsT explains, this should be no smaller than your acceptable type I error, and in many cases use asymptotic approximations.
- Carry out the random experiment, compute the test statistic and see whether it falls in the rejection region.
In theory, there is a strict equivalence between the events "the statistic falls in the rejection region" and "the p-value is less than $\alpha$", which is why it is felt that you can report the p-value instead. In practice, it allows you to skip step 3. and evaluate the type I error after the test is done.
To come back to your post, the statement of the null hypothesis is incorrect. The null hypothesis is that the probability of flipping a head is $1/2$ (the null hypothesis cannot pertain to the results of the random experiment).
If you repeat the experiment again and again with a threshold p-value of 0.05, yes, you should have approximately 5% rejection. And if you set a p-value cut-off of 0.06, you should end up with roughly 6% rejection. More generally, for continuous tests, by definition of the p-value $p$
$$ Prob(p < x) = x, \, (0 < x < 1), $$
which is only approximately true for discrete tests.
Here is some R code that I hope can clarify this a bit. The binomial test is relatively slow, so I do only 10,000 random experiments in which I flip 1000 coins. I perform a binomial test and collect the 10,000 p-values.
set.seed(123)
# Generate 10,000 random experiments of each 1000 coin flipping
rexperiments <- rbinom(n=10000, size=1000, prob=0.5)
all_p_values <- rep(NA, 10000)
for (i in 1:10000) {
all_p_values[i] <- binom.test(rexperiments[i], 1000)$p.value
}
# Plot the cumulative density of p-values.
plot(ecdf(all_p_values))
# How many are less than 0.05?
mean(all_p_values < 0.05)
# [1] 0.0425
# How many are less than 0.06?
mean(all_p_values < 0.06)
# 0.0491
You can see that the proportions are not exact, because the sample size is not infinite and the test is discrete, but there is still an increase of roughly 1% between the two.
My personal appraisal of his arguments:
- Here he talks about using $p$ as evidence for the Null, whereas his thesis is that $p$ can't be used as evidence against the Null. So, I think this argument is largely irrelevant.
- I think this is a misunderstanding. Fisherian $p$ testing follows strongly in the idea of Popper's Critical Rationalism that states you cannot support a theory but only criticize it. So in that sense there only is a single hypothesis (the Null) and you simply check if your data are in accordance with it.
- I disagree here. It depends on the test statistic but $p$ is usually a transformation of an effect size that speaks against the Null. So the higher the effect, the lower the p value---all other things equal. Of course, for different data sets or hypotheses this is no longer valid.
- I am not sure I completely understand this statement, but from what I can gather this is less a problem of $p$ as of people using it wrongly. $p$ was intended to have the long-run frequency interpretation and that is a feature not a bug. But you can't blame $p$ for people taking a single $p$ value as proof for their hypothesis or people publishing only $p<.05$.
His suggestion of using the likelihood ratio as a measure of evidence is in my opinion a good one (but here the idea of a Bayes factor is more general), but in the context in which he brings it is a bit peculiar: First he leaves the grounds of Fisherian testing where there is no alternative hypothesis to calculate the likelihood ratio from. But $p$ as evidence against the Null is Fisherian. Hence he confounds Fisher and Neyman-Pearson. Second, most test statistics that we use are (functions of) the likelihood ratio and in that case $p$ is a transformation of the likelihood ratio. As Cosma Shalizi puts it:
among all tests of a given size $s$ , the one with the smallest miss
probability, or highest power, has the form "say 'signal' if
$q(x)/p(x) > t(s)$, otherwise say 'noise'," and that the threshold $t$
varies inversely with $s$. The quantity $q(x)/p(x)$ is the likelihood
ratio; the Neyman-Pearson lemma says that to maximize power, we should
say "signal" if it is sufficiently more likely than noise.
Here $q(x)$ is the density under state "signal" and $p(x)$ the density under state "noise". The measure for "sufficiently likely" would here be $P(q(X)/p(x) > t_{obs} \mid H_0)$ which is $p$. Note that in correct Neyman-Pearson testing $t_{obs}$ is substituted by a fixed $t(s)$ such that $P(q(X)/p(x) > t(s) \mid H_0)=\alpha$.
Best Answer
Perhaps. What convinces you of this?
Not at all. How did you go from 'continuous measure of evidence against' to 'there is no difference'?
In particular, Fisher would not make the mistake of thinking that failure to reject makes $H_0$ actually true.
No, for two reasons.
(i) if $p>\alpha$ you won't reject, so you can't commit a type I error at all
(ii) You don't even have an $\alpha$ probability of making a type I error, since the type I error rate is a conditional probability, and in real situations, the joint probability is close to zero (that is, point null hypotheses are almost never exactly true; you can only make a type I error when they are exactly true).
[ ... I suppose that I'm arguably acting more as a Bayesian there]