Hypothesis Testing – Misunderstanding a P-Value

hypothesis testingp-value

So I've been reading a lot about how to correctly interpret a P-value, and from what I've read, the p-value says NOTHING about the probability that the null hypothesis is true or false. However, when reading the following statement:

The p – value represents the probability of making a type I error, or
rejecting the null hypothesis when it is true. The smaller the p
value, the smaller is the probability that you would be wrongly
rejecting the null hypothesis.

EDIT: And then 5 minutes later I read:

Incorrect interpretations of P values are very common. The most common
mistake is to interpret a P value as the probability of making a
mistake by rejecting a true null hypothesis (a Type I error).

This confused me. Which one is correct? And can anyone please explain how to correctly interpret the p-value and how it properly relates back to probability of making a type I error?

Best Answer

Because of your comments I will make two separate sections:

p-values

In statistical hypothesis testing you can find 'statistical evidence' for the alternative hypothesis; As I explained in What follows if we fail to reject the null hypothesis?, it is similar to 'proof by contradiction' in mathematics.

So if we want to find 'statistical evidence' then we assume the opposite, which we denote $H_0$ of what we try to proof which we call $H_1$. After this we draw a sample, and from the sample we compute a so-called test-statistic (e.g. a t-value in a t-test).

Then, as we assume that $H_0$ is true and that our sample is randomly drawn from the distribution under $H_0$, we can compute the probability of observing values that exceed or equal the value derived from our (random) sample. This probability is called the p-value.

If this value is ''small enough'', i.e. smaller than the significance level thase we have choosen, then we reject $H_0$ and we consider to $H_1$ is 'statistically proven'.

Several things are important in this way of doing:

  • we have derived probabilities under the assumption that $H_0$ is true
  • we have taken a random sample from the distrubtion that was assumed under $H_0$
  • we decide to have found evidence for $H_1$ if the test-statistic derived from the random sample has a low probability of being exceeded. So it is not impossible that it is exceeded while $H_0$ is true and in these cases we make a type I error.

So what is a type I error: a type I error is made when the sample, randomly drawn from $H_0$, leads to the conclusion that $H_0$ is false while in reality it is true.

Note that this implies that a p-value is not the probability of a type I error. Indeed, a type I error is a wrong decision by the test and the decision can only be made by comparing the p-value to the choosen significance level, with a p-value alone one can not make a decision, it is only after comparing the p-value to the choosen significance level that a decision is made, and as long as no decision is made, type I error is not even defined.

What then is the p-value ? The potentially wrong rejection of $H_0$ is due to the fact that we draw a random sample under $H_0$, so it could be that we have ''bad luck'' by drawing the sample, and that this ''bad luck'' leads to a false rejection of $H_0$. So the p-value (although this is not fully correct) is more like the probability of drawing a ''bad sample''. The correct interpretation of the p-value is that it is the probability that the test-statistic exceeds or equals the value of the test-statistic derived from a randomly drawn sample under $H_0$


False discovery rate (FDR)

As explained above, each time the null hypothesis is rejected, one considers this as 'statistical evidence' for $H_1$. So we have found new scientific knowledge, therefore it is called a discovery. Also explained above is that we can make false discoveries (i.e. falsely rejecting $H_0$) when we make a type I error. In that case we have a false belief of a scientific truth. We only want to discover really true things and therefore one tries to keep the false discoveries to a minimum, i.e. one will control for a type I error. It is not so hard to see that the probability of a type I error is the chosen significance level $\alpha$. So in order to control for type I errors, one fixes an $\alpha$-level reflecting your willingness to accept ''false evidence''.

Intuitively, this means that if we draw a huge number of samples, and with each sample we perform the test, then a fraction $\alpha$ of these tests will lead to a wrong conclusion. It is important to note that we're 'averaging over many samples'; so same test, many samples.

If we use the same sample to do many different tests then we have a multiple testing error (see my anser on Family-wise error boundary: Does re-using data sets on different studies of independent questions lead to multiple testing problems?). In that case one can control the $\alpha$ inflation using techniques to control the family-wise error rate (FWER), like e.g. a Bonferroni correction.

A different approach than FWER is to control the false discovery rate (FDR). In that case one controls the number of false discoveries (FD) among all discoveries (D), so one controls $\frac{FD}{D}$, D is the number of rejected $H_0$.

So the type I error probability has to do with executing the same test on many different samples. For a huge number of samples the type I error probability will converge to the number of samples leading to a false rejection divided by the total number of samples drawn.

The FDR has to do with many tests on the same sample and for a huge number of tests it will converge to the number of tests where a type I error is made (i.e. the number of false discoveries) divided by total the number of rejections of $H_0$ (i.e. the total number of discoveries).

Note that, comparing the two paragraphs above:

  1. The context is different; one test and many samples versus many tests and one sample.
  2. The denominator for computing the type I error probability is clearly different from the denominator for computing the FDR. The numerators are similar in a way, but have a different context.

The FDR tells you that, if you perform many tests on the same sample and you find 1000 discoveries (i.e. rejections of $H_0$) then with an FDR of 0.38 you will have $0.38 \times 1000$ false discoveries.

Related Question