I was wondering if anybody could give a concise rundown as to the definitions and uses of p-values, significance level and type I error.
I understand that p-values are defined as "the probability of obtaining a test statistic at least as extreme as the one we actually observed", while a significance level is just an arbitrary cutoff value to gauge if the p-value is significant or not. Type I error is the error of rejected a null hypothesis that was true. However, I am unsure regarding the difference between significance level and the type I error, are they not the same concept?
For example, assume a very simple experiment where I flip a coin 1000 times and count the number of times it lands on 'heads'. My null hypothesis, H0, is that heads = 500 (unbiased coin). I then set my significance level at alpha = 0.05.
I flip the coin 1000 times and then I calculate the p-value, if the p-value is > 0.05 then I fail to reject the null hypothesis and if the p-value is < 0.05 then I reject the null hypothesis.
Now if I did this experiment repeatedly, each time calculating the p-value and either rejecting or failing to reject the null hypothesis and keeping a count of how many I rejected/failed to reject, then I would end up rejecting 5% of null hypotheses which were in actuality true, is that correct? This is the definition of type I error. Therefore, the significance level in Fisher significance testing is essentially the type I error from Neyman-Pearson hypothesis testing if you performed repeated experiments.
Now as for p-values, if I had gotten a p-value of 0.06 from my last experiment and I did multiple experiments and counted all the ones that I got a p-value from 0 to 0.06, then would I also not have a 6% chance of rejecting a true null hypothesis?
Best Answer
The question looks simple, but your reflection around it shows that it is not that simple.
Actually, p-values are a relatively late addition to the theory of statistics. Computing a p-value without a computer is very tedious; this is why the only way to perform a statistical test until recently was to use tables of statistical tests, as I explain in this blog post. Because those tables were computed for fixed $\alpha$ levels (typically 0.05, 0.01 and 0.001) you could only perform a test with those levels.
Computers made those tables useless, but the logic of testing is still the same. You should:
In theory, there is a strict equivalence between the events "the statistic falls in the rejection region" and "the p-value is less than $\alpha$", which is why it is felt that you can report the p-value instead. In practice, it allows you to skip step 3. and evaluate the type I error after the test is done.
To come back to your post, the statement of the null hypothesis is incorrect. The null hypothesis is that the probability of flipping a head is $1/2$ (the null hypothesis cannot pertain to the results of the random experiment).
If you repeat the experiment again and again with a threshold p-value of 0.05, yes, you should have approximately 5% rejection. And if you set a p-value cut-off of 0.06, you should end up with roughly 6% rejection. More generally, for continuous tests, by definition of the p-value $p$
$$ Prob(p < x) = x, \, (0 < x < 1), $$
which is only approximately true for discrete tests.
Here is some R code that I hope can clarify this a bit. The binomial test is relatively slow, so I do only 10,000 random experiments in which I flip 1000 coins. I perform a binomial test and collect the 10,000 p-values.
You can see that the proportions are not exact, because the sample size is not infinite and the test is discrete, but there is still an increase of roughly 1% between the two.