The question looks simple, but your reflection around it shows that it is not that simple.
Actually, p-values are a relatively late addition to the theory of statistics. Computing a p-value without a computer is very tedious; this is why the only way to perform a statistical test until recently was to use tables of statistical tests, as I explain in this blog post. Because those tables were computed for fixed $\alpha$ levels (typically 0.05, 0.01 and 0.001) you could only perform a test with those levels.
Computers made those tables useless, but the logic of testing is still the same. You should:
- Formulate a null hypothesis.
- Formulate an alternative hypothesis.
- Decide a maximum type I error (the probability of falsely rejecting the null hypothesis) error you are ready to accept.
- Design a rejection region. The probability that the test statistic falls in the rejection region given that the null hypothesis is your level $\alpha$. As @MÃ¥nsT explains, this should be no smaller than your acceptable type I error, and in many cases use asymptotic approximations.
- Carry out the random experiment, compute the test statistic and see whether it falls in the rejection region.
In theory, there is a strict equivalence between the events "the statistic falls in the rejection region" and "the p-value is less than $\alpha$", which is why it is felt that you can report the p-value instead. In practice, it allows you to skip step 3. and evaluate the type I error after the test is done.
To come back to your post, the statement of the null hypothesis is incorrect. The null hypothesis is that the probability of flipping a head is $1/2$ (the null hypothesis cannot pertain to the results of the random experiment).
If you repeat the experiment again and again with a threshold p-value of 0.05, yes, you should have approximately 5% rejection. And if you set a p-value cut-off of 0.06, you should end up with roughly 6% rejection. More generally, for continuous tests, by definition of the p-value $p$
$$ Prob(p < x) = x, \, (0 < x < 1), $$
which is only approximately true for discrete tests.
Here is some R code that I hope can clarify this a bit. The binomial test is relatively slow, so I do only 10,000 random experiments in which I flip 1000 coins. I perform a binomial test and collect the 10,000 p-values.
set.seed(123)
# Generate 10,000 random experiments of each 1000 coin flipping
rexperiments <- rbinom(n=10000, size=1000, prob=0.5)
all_p_values <- rep(NA, 10000)
for (i in 1:10000) {
all_p_values[i] <- binom.test(rexperiments[i], 1000)$p.value
}
# Plot the cumulative density of p-values.
plot(ecdf(all_p_values))
# How many are less than 0.05?
mean(all_p_values < 0.05)
# [1] 0.0425
# How many are less than 0.06?
mean(all_p_values < 0.06)
# 0.0491
You can see that the proportions are not exact, because the sample size is not infinite and the test is discrete, but there is still an increase of roughly 1% between the two.
Because of your comments I will make two separate sections:
p-values
In statistical hypothesis testing you can find 'statistical evidence' for the alternative hypothesis; As I explained in What follows if we fail to reject the null hypothesis?, it is similar to 'proof by contradiction' in mathematics.
So if we want to find 'statistical evidence' then we assume the opposite, which we denote $H_0$ of what we try to proof which we call $H_1$. After this we draw a sample, and from the sample we compute a so-called test-statistic (e.g. a t-value in a t-test).
Then, as we assume that $H_0$ is true and that our sample is randomly drawn from the distribution under $H_0$, we can compute the probability of observing values that exceed or equal the value derived from our (random) sample. This probability is called the p-value.
If this value is ''small enough'', i.e. smaller than the significance level thase we have choosen, then we reject $H_0$ and we consider to $H_1$ is 'statistically proven'.
Several things are important in this way of doing:
- we have derived probabilities under the assumption that $H_0$ is true
- we have taken a random sample from the distrubtion that was assumed under $H_0$
- we decide to have found evidence for $H_1$ if the test-statistic derived from the random sample has a low probability of being exceeded. So it is not impossible that it is exceeded while $H_0$ is true and in these cases we make a type I error.
So what is a type I error: a type I error is made when the sample, randomly drawn from $H_0$, leads to the conclusion that $H_0$ is false while in reality it is true.
Note that this implies that a p-value is not the probability of a type I error. Indeed, a type I error is a wrong decision by the test and the decision can only be made by comparing the p-value to the choosen significance level, with a p-value alone one can not make a decision, it is only after comparing the p-value to the choosen significance level that a decision is made, and as long as no decision is made, type I error is not even defined.
What then is the p-value ? The potentially wrong rejection of $H_0$ is due to the fact that we draw a random sample under $H_0$, so it could be that we have ''bad luck'' by drawing the sample, and that this ''bad luck'' leads to a false rejection of $H_0$. So the p-value (although this is not fully correct) is more like the probability of drawing a ''bad sample''. The correct interpretation of the p-value is that it is the probability that the test-statistic exceeds or equals the value of the test-statistic derived from a randomly drawn sample under $H_0$
False discovery rate (FDR)
As explained above, each time the null hypothesis is rejected, one considers this as 'statistical evidence' for $H_1$. So we have found new scientific knowledge, therefore it is called a discovery. Also explained above is that we can make false discoveries (i.e. falsely rejecting $H_0$) when we make a type I error. In that case we have a false belief of a scientific truth. We only want to discover really true things and therefore one tries to keep the false discoveries to a minimum, i.e. one will control for a type I error. It is not so hard to see that the probability of a type I error is the chosen significance level $\alpha$. So in order to control for type I errors, one fixes an $\alpha$-level reflecting your willingness to accept ''false evidence''.
Intuitively, this means that if we draw a huge number of samples, and with each sample we perform the test, then a fraction $\alpha$ of these tests will lead to a wrong conclusion. It is important to note that we're 'averaging over many samples'; so same test, many samples.
If we use the same sample to do many different tests then we have a multiple testing error (see my anser on Family-wise error boundary: Does re-using data sets on different studies of independent questions lead to multiple testing problems?). In that case one can control the $\alpha$ inflation using techniques to control the family-wise error rate (FWER), like e.g. a Bonferroni correction.
A different approach than FWER is to control the false discovery rate (FDR). In that case one controls the number of false discoveries (FD) among all discoveries (D), so one controls $\frac{FD}{D}$, D is the number of rejected $H_0$.
So the type I error probability has to do with executing the same test on many different samples. For a huge number of samples the type I error probability will converge to the number of samples leading to a false rejection divided by the total number of samples drawn.
The FDR has to do with many tests on the same sample and for a huge number of tests it will converge to the number of tests where a type I error is made (i.e. the number of false discoveries) divided by total the number of rejections of $H_0$ (i.e. the total number of discoveries).
Note that, comparing the two paragraphs above:
- The context is different; one test and many samples versus many tests and one sample.
- The denominator for computing the type I error probability is clearly different from the denominator for computing the FDR. The numerators are similar in a way, but have a different context.
The FDR tells you that, if you perform many tests on the same sample and you find 1000 discoveries (i.e. rejections of $H_0$) then with an FDR of 0.38 you will have $0.38 \times 1000$ false discoveries.
Best Answer
There are two p-values of interest. The critical p-value, also known as the $\alpha$-level or significance level, is decided and fixed before the study / analysis is performed. This critical p-value is in fact the probability of Type I error. Your examples are talking about the critical p-value, and I agree with those who are saying they are incorrect or at least misleading, because in common usage "the p-value" refers to the next type of p-value.
The observed p-value, or calculated p-value, is calculated as part of a statistical test. The observed p-value is the probability under the null hypothesis of observing data that leads to a test statistic as or more extreme than the test statistic actually observed in the experiment. The definition of "extreme" varies depending on your test, e.g. one-sided versus two-sided. The common usage of the phrase "the p-value" usually refers to the observed p-value.
We reject the null hypothesis if the observed p-value is less than or equal to the critical p-value. This is because, under the null hypothesis, the observed p-value is uniformly distributed on $[0,1]$. Thus the probability that we reject the null when in fact the null is true is equal to the critical p-value.
Edit: The full section in [1] is talking about the observed p-values, but the one line you quoted only makes sense when talking about the critical p-value. If you were to force it to be about observed p-values and be correct, it would be something like "It is the probability of a new, identical study wrongly rejecting the null hypothesis if you set your alpha-level equal to it." Strictly speaking, once you've done your analysis, the probability of wrongly rejecting the null hypothesis is either 0 or 1, depending on whether or not the null hypothesis is true and whether or not you rejected it!