The question looks simple, but your reflection around it shows that it is not that simple.
Actually, p-values are a relatively late addition to the theory of statistics. Computing a p-value without a computer is very tedious; this is why the only way to perform a statistical test until recently was to use tables of statistical tests, as I explain in this blog post. Because those tables were computed for fixed $\alpha$ levels (typically 0.05, 0.01 and 0.001) you could only perform a test with those levels.
Computers made those tables useless, but the logic of testing is still the same. You should:
- Formulate a null hypothesis.
- Formulate an alternative hypothesis.
- Decide a maximum type I error (the probability of falsely rejecting the null hypothesis) error you are ready to accept.
- Design a rejection region. The probability that the test statistic falls in the rejection region given that the null hypothesis is your level $\alpha$. As @MÃ¥nsT explains, this should be no smaller than your acceptable type I error, and in many cases use asymptotic approximations.
- Carry out the random experiment, compute the test statistic and see whether it falls in the rejection region.
In theory, there is a strict equivalence between the events "the statistic falls in the rejection region" and "the p-value is less than $\alpha$", which is why it is felt that you can report the p-value instead. In practice, it allows you to skip step 3. and evaluate the type I error after the test is done.
To come back to your post, the statement of the null hypothesis is incorrect. The null hypothesis is that the probability of flipping a head is $1/2$ (the null hypothesis cannot pertain to the results of the random experiment).
If you repeat the experiment again and again with a threshold p-value of 0.05, yes, you should have approximately 5% rejection. And if you set a p-value cut-off of 0.06, you should end up with roughly 6% rejection. More generally, for continuous tests, by definition of the p-value $p$
$$ Prob(p < x) = x, \, (0 < x < 1), $$
which is only approximately true for discrete tests.
Here is some R code that I hope can clarify this a bit. The binomial test is relatively slow, so I do only 10,000 random experiments in which I flip 1000 coins. I perform a binomial test and collect the 10,000 p-values.
set.seed(123)
# Generate 10,000 random experiments of each 1000 coin flipping
rexperiments <- rbinom(n=10000, size=1000, prob=0.5)
all_p_values <- rep(NA, 10000)
for (i in 1:10000) {
all_p_values[i] <- binom.test(rexperiments[i], 1000)$p.value
}
# Plot the cumulative density of p-values.
plot(ecdf(all_p_values))
# How many are less than 0.05?
mean(all_p_values < 0.05)
# [1] 0.0425
# How many are less than 0.06?
mean(all_p_values < 0.06)
# 0.0491
You can see that the proportions are not exact, because the sample size is not infinite and the test is discrete, but there is still an increase of roughly 1% between the two.
Best Answer
Your understanding is mostly correct. Let $X$ be a random variable that follows the same distribution as your test statistic under the null hypothesis. The p value is the probability that a randomly drawn $X$ is at least as large as the test statistic you computed. If that probability is very low, then that is good reason to believe that the null hypothesis does not hold.
You just need to be careful about the difference in terminology between p value and significance level. A significance level is a pre-specified cutoff p value, below which you reject the null hypothesis and above which you do not have enough evidence to reject the null hypothesis. The p value itself is just a probability-valued function of the test statistic that gets smaller as the test statistic gets more extreme (i.e. the CDF of the distribution of the test statistic under the null).
So the significance level does not determine the probability of rejecting the null hypothesis. The significance level determines the largest probability of rejecting the null that you would consider evidence enough to reject the null. When you set a significance level, you are setting an upper bound, below which you find the probability of observing the null too extreme to believe it was randomly drawn from the null distribution.
You might have been confused by someone talking about type 1 error rates and such. All that stuff means is that, if you run the experiment many times, if the null hypothesis is true ever time, and you set your significance level to $\alpha$, you will reject the null hypothesis $\alpha \times 100$% of the time purely due to random chance. Understanding this can help you set reasonable $\alpha$ levels if you do plan to do null hypothesis testing.