Hypothesis Testing – Comparing and Contrasting P-Values, Significance Levels, and Type I Error

errorhypothesis testingp-valueprobabilitystatistical significance

I was wondering if anybody could give a concise rundown as to the definitions and uses of p-values, significance level and type I error.

I understand that p-values are defined as "the probability of obtaining a test statistic at least as extreme as the one we actually observed", while a significance level is just an arbitrary cutoff value to gauge if the p-value is significant or not. Type I error is the error of rejected a null hypothesis that was true. However, I am unsure regarding the difference between significance level and the type I error, are they not the same concept?

For example, assume a very simple experiment where I flip a coin 1000 times and count the number of times it lands on 'heads'. My null hypothesis, H0, is that heads = 500 (unbiased coin). I then set my significance level at alpha = 0.05.

I flip the coin 1000 times and then I calculate the p-value, if the p-value is > 0.05 then I fail to reject the null hypothesis and if the p-value is < 0.05 then I reject the null hypothesis.

Now if I did this experiment repeatedly, each time calculating the p-value and either rejecting or failing to reject the null hypothesis and keeping a count of how many I rejected/failed to reject, then I would end up rejecting 5% of null hypotheses which were in actuality true, is that correct? This is the definition of type I error. Therefore, the significance level in Fisher significance testing is essentially the type I error from Neyman-Pearson hypothesis testing if you performed repeated experiments.

Now as for p-values, if I had gotten a p-value of 0.06 from my last experiment and I did multiple experiments and counted all the ones that I got a p-value from 0 to 0.06, then would I also not have a 6% chance of rejecting a true null hypothesis?

Best Answer

The question looks simple, but your reflection around it shows that it is not that simple.

Actually, p-values are a relatively late addition to the theory of statistics. Computing a p-value without a computer is very tedious; this is why the only way to perform a statistical test until recently was to use tables of statistical tests, as I explain in this blog post. Because those tables were computed for fixed $\alpha$ levels (typically 0.05, 0.01 and 0.001) you could only perform a test with those levels.

Computers made those tables useless, but the logic of testing is still the same. You should:

  1. Formulate a null hypothesis.
  2. Formulate an alternative hypothesis.
  3. Decide a maximum type I error (the probability of falsely rejecting the null hypothesis) error you are ready to accept.
  4. Design a rejection region. The probability that the test statistic falls in the rejection region given that the null hypothesis is your level $\alpha$. As @MÃ¥nsT explains, this should be no smaller than your acceptable type I error, and in many cases use asymptotic approximations.
  5. Carry out the random experiment, compute the test statistic and see whether it falls in the rejection region.

In theory, there is a strict equivalence between the events "the statistic falls in the rejection region" and "the p-value is less than $\alpha$", which is why it is felt that you can report the p-value instead. In practice, it allows you to skip step 3. and evaluate the type I error after the test is done.

To come back to your post, the statement of the null hypothesis is incorrect. The null hypothesis is that the probability of flipping a head is $1/2$ (the null hypothesis cannot pertain to the results of the random experiment).

If you repeat the experiment again and again with a threshold p-value of 0.05, yes, you should have approximately 5% rejection. And if you set a p-value cut-off of 0.06, you should end up with roughly 6% rejection. More generally, for continuous tests, by definition of the p-value $p$

$$ Prob(p < x) = x, \, (0 < x < 1), $$

which is only approximately true for discrete tests.

Here is some R code that I hope can clarify this a bit. The binomial test is relatively slow, so I do only 10,000 random experiments in which I flip 1000 coins. I perform a binomial test and collect the 10,000 p-values.

set.seed(123)
# Generate 10,000 random experiments of each 1000 coin flipping
rexperiments <- rbinom(n=10000, size=1000, prob=0.5)
all_p_values <- rep(NA, 10000)
for (i in 1:10000) {
    all_p_values[i] <- binom.test(rexperiments[i], 1000)$p.value
}
# Plot the cumulative density of p-values.
plot(ecdf(all_p_values))
# How many are less than 0.05?
mean(all_p_values < 0.05)
# [1] 0.0425
# How many are less than 0.06?
mean(all_p_values < 0.06)
# 0.0491

You can see that the proportions are not exact, because the sample size is not infinite and the test is discrete, but there is still an increase of roughly 1% between the two.

Related Question