Because of your comments I will make two separate sections:
p-values
In statistical hypothesis testing you can find 'statistical evidence' for the alternative hypothesis; As I explained in What follows if we fail to reject the null hypothesis?, it is similar to 'proof by contradiction' in mathematics.
So if we want to find 'statistical evidence' then we assume the opposite, which we denote $H_0$ of what we try to proof which we call $H_1$. After this we draw a sample, and from the sample we compute a so-called test-statistic (e.g. a t-value in a t-test).
Then, as we assume that $H_0$ is true and that our sample is randomly drawn from the distribution under $H_0$, we can compute the probability of observing values that exceed or equal the value derived from our (random) sample. This probability is called the p-value.
If this value is ''small enough'', i.e. smaller than the significance level thase we have choosen, then we reject $H_0$ and we consider to $H_1$ is 'statistically proven'.
Several things are important in this way of doing:
- we have derived probabilities under the assumption that $H_0$ is true
- we have taken a random sample from the distrubtion that was assumed under $H_0$
- we decide to have found evidence for $H_1$ if the test-statistic derived from the random sample has a low probability of being exceeded. So it is not impossible that it is exceeded while $H_0$ is true and in these cases we make a type I error.
So what is a type I error: a type I error is made when the sample, randomly drawn from $H_0$, leads to the conclusion that $H_0$ is false while in reality it is true.
Note that this implies that a p-value is not the probability of a type I error. Indeed, a type I error is a wrong decision by the test and the decision can only be made by comparing the p-value to the choosen significance level, with a p-value alone one can not make a decision, it is only after comparing the p-value to the choosen significance level that a decision is made, and as long as no decision is made, type I error is not even defined.
What then is the p-value ? The potentially wrong rejection of $H_0$ is due to the fact that we draw a random sample under $H_0$, so it could be that we have ''bad luck'' by drawing the sample, and that this ''bad luck'' leads to a false rejection of $H_0$. So the p-value (although this is not fully correct) is more like the probability of drawing a ''bad sample''. The correct interpretation of the p-value is that it is the probability that the test-statistic exceeds or equals the value of the test-statistic derived from a randomly drawn sample under $H_0$
False discovery rate (FDR)
As explained above, each time the null hypothesis is rejected, one considers this as 'statistical evidence' for $H_1$. So we have found new scientific knowledge, therefore it is called a discovery. Also explained above is that we can make false discoveries (i.e. falsely rejecting $H_0$) when we make a type I error. In that case we have a false belief of a scientific truth. We only want to discover really true things and therefore one tries to keep the false discoveries to a minimum, i.e. one will control for a type I error. It is not so hard to see that the probability of a type I error is the chosen significance level $\alpha$. So in order to control for type I errors, one fixes an $\alpha$-level reflecting your willingness to accept ''false evidence''.
Intuitively, this means that if we draw a huge number of samples, and with each sample we perform the test, then a fraction $\alpha$ of these tests will lead to a wrong conclusion. It is important to note that we're 'averaging over many samples'; so same test, many samples.
If we use the same sample to do many different tests then we have a multiple testing error (see my anser on Family-wise error boundary: Does re-using data sets on different studies of independent questions lead to multiple testing problems?). In that case one can control the $\alpha$ inflation using techniques to control the family-wise error rate (FWER), like e.g. a Bonferroni correction.
A different approach than FWER is to control the false discovery rate (FDR). In that case one controls the number of false discoveries (FD) among all discoveries (D), so one controls $\frac{FD}{D}$, D is the number of rejected $H_0$.
So the type I error probability has to do with executing the same test on many different samples. For a huge number of samples the type I error probability will converge to the number of samples leading to a false rejection divided by the total number of samples drawn.
The FDR has to do with many tests on the same sample and for a huge number of tests it will converge to the number of tests where a type I error is made (i.e. the number of false discoveries) divided by total the number of rejections of $H_0$ (i.e. the total number of discoveries).
Note that, comparing the two paragraphs above:
- The context is different; one test and many samples versus many tests and one sample.
- The denominator for computing the type I error probability is clearly different from the denominator for computing the FDR. The numerators are similar in a way, but have a different context.
The FDR tells you that, if you perform many tests on the same sample and you find 1000 discoveries (i.e. rejections of $H_0$) then with an FDR of 0.38 you will have $0.38 \times 1000$ false discoveries.
Suppose a previous process for making a particular kind of steel wire
yielded wire with breaking strength $\mathsf{Norm}(\mu=50,\sigma=5).$
A new process is now in use and we would like to know if the
breaking strength has changed. If different, we have no basis for guessing
whether it is higher or lower.
Now $n = 42$ test specimens of
the new wire are available and their breaking strengths, recorded in vector x
have been
determined. A change of $2$ or more would be a practical importance.
We wish to use a two-sided, one-sample t test, at the 5% level, of $H_0: \mu=50$ against
the alternative $H_a: \mu \ne 50.$ In R, the relevant test gives the following
output. The result of this two-sided test is not significant at the 5% level.
t.test(x, mu=50)
One Sample t-test
data: x
t = 1.9969, df = 41, p-value = 0.0525
alternative hypothesis:
true mean is not equal to 50
95 percent confidence interval:
49.97994 53.56558
sample estimates:
mean of x
51.77276
Before the specimens from the new process were measured for breaking strength,
we used the standard deviation $\sigma=5$ and the important difference $\Delta = 2$ to see how many specimens should be used for the test. We determined that $n=45$ specimens would suffice to give power (probability of detecting a real difference of size $\Delta=2)$ about 75%. So the test was not 'sure' to
give a significant result even if there is a real difference. To make matters a little worse, we
got only $n=42$ specimens.
set.seed(1005)
pv = replicate(10^5, t.test(rnorm(45, 52, 5), mu=50)$p.val)
mean(pv <= 0.05)
[1] 0.74662
Now suppose someone notices that the sample mean $\bar X = 51.77$ is
larger than $\mu_0 = 50$ and suggests that we could get a P-value
smaller than the magical 5% level by doing a one-sided test, as shown below.
The P-value of the right-sided test is half the P-value of the two-sided test.
t.test(x, mu=50, alt="greater")
One Sample t-test
data: x
t = 1.9969, df = 41, p-value = 0.02625
alternative hypothesis:
true mean is greater than 50
95 percent confidence interval:
50.27881 Inf
sample estimates:
mean of x
51.77276
There are several things wrong with using this one-sided test to declare
that the new process differs significantly from the old one. Here are a few.
We set out to test for a change in either direction. Now a second analysis
of the same data has 'declared' an increase with significance barely below the 5% level. This is "P-hacking," which can lead to "false discovery."
The 95% confidence interval for $\mu$ from the two-sided test is $(49.98,\, 53.57),$
which includes the hypothetical value 50 (if only just barely).
The actual difference between $\mu=50$ and $\bar X = 51.77$ is less than
the 2 units we said is of practical importance.
We had planned a somewhat skimpy sample size of 45 in our power computation and finally had only 42 available. Maybe the new process is different than the old, and maybe not. We don't have enough data to say it is.
Note: The fictitious data used above was sampled in R as shown below. Of course, in a real-life application the exact population parameters would never be known.
set.seed(2021)
x = rnorm(42, 52, 5)
Best Answer
(Technically, the P-value is the probability of observing data at least as extreme as that actually observed, given the null hypothesis.)
Q1. A decision to reject the null hypothesis on the basis of a small P-value typically depends on 'Fisher's disjunction': Either a rare event has happened or the null hypothesis is false. In effect, it is rarity of the event is what the P-value tells you rather than the probability that the null is false.
The probability that the null is false can be obtained from the experimental data only by way of Bayes' theorem, which requires specification of the 'prior' probability of the null hypothesis (presumably what Gill is referring to as "marginal distributions").
Q2. This part of your question is much harder than it might seem. There is a great deal of confusion regarding P-values and error rates which is, presumably, what Gill is referring to with "but is typically treated as such." The combination of Fisherian P-values with Neyman-Pearsonian error rates has been called an incoherent mishmash, and it is unfortunately very widespread. No short answer is going to be completely adequate here, but I can point you to a couple of good papers (yes, one is mine). Both will help you make sense of the Gill paper.
Hurlbert, S., & Lombardi, C. (2009). Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici, 46(5), 311–349. (Link to paper)
Lew, M. J. (2012). Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don't know P. British Journal of Pharmacology, 166(5), 1559–1567. doi:10.1111/j.1476-5381.2012.01931.x (Link to paper)