Why do we consider a critical region instead of individual values? Why does the alternative hypothesis determine which tail to consider for rejection

hypothesis testingstatistics

Setup:

Suppose a coin is tossed 8 times and I'm trying to determine whether the coin is biased in favour of landing on heads. Let $X$ be the number of heads in 8 tosses, so $X \sim B(8,p)$. Conducting a hypothesis test with a significance level of $5\%$, my null hypothesis would be $H_0\!: p = 0.5$ and my alternative hypothesis would be $H_1\!: p > 0.5$ .

The method of conducting this test that I have learned is that I would first need to consider $X \sim B(8,0.5)$. Since the alternative hypothesis is p greater than 0.5, I would need only consider the right tail of the distribution. Then I would need to find the minimum value of $a$ such that $P(X\geq a) < 0.05$, so $[a,8]$ is the critical region. Thus if we observe $X$ to be between $[a,8]$, it is statistically probable that the die is biased.

Question:

  • Why do we consider a critical region instead of individual values?

It makes sense to me why we would consider a region in the context of continuous distributions, but why should we do it in the case of a discrete distribution? Suppose from the setup, $P(X=8) = 0.03$ and $P(X=7) = 0.049$, then $X=7$ would not be part of the critical region even though individually its probability is still below the set threshold of $5\%$. I don't understand why we should disregard $X=7$.

  • Why does the alternative hypothesis determine which tail to consider for rejection?

The rule I have been taught is, if the alternative hypothesis contains a ">", consider the right tail and if it contains a "<", consider the left tail. Suppose from the setup, $P(X\geq7) < 0.05$, then it's also true that $P(X\leq2) < 0.05$. Thus why are we only considering one of the tails even though both tails contain values which are below the $5\%$ significance level? As $p$ increases, both of these regions are going to change in values, so I don't understand why we're focusing on the right tail. How does the alternative hypothesis play into this?

Thanks for taking the time to read my question. Please don't respond with too much jargon-heavy language since I'm still in high school! I appreciate any and all help.

Best Answer

The answer to your question can be best appreciated by recalling the nature of the significance level $\alpha$. This is a value that we choose that reflects our tolerance for Type I error, which in this case, is the outcome of incorrectly concluding the coin is biased in favor of heads when in fact it is not.

To see why any nontrivial hypothesis testing procedure must admit some nonzero Type I error, consider that even when a coin is perfectly fair, there is some small but nonzero probability that it could land heads in $n$ throws: specifically, this is just $\Pr[X = n \mid p = 1/2] = 2^{-n} > 0$. Yes, a large number of trials will make this a very small chance, but it is still greater than $0$. So any hypothesis test you can design that allows some chance of rejecting the statement that the coin is fair, must do so with some possibility of being wrong by random chance.

For instance, suppose you set the rejection criterion to be $X = n$, that is to say, you will claim the coin is unfair if in $n$ trials you get all heads; otherwise you say the data is inconclusive. Then the Type I error of the test is precisely $\alpha = 2^{-n}$: $$\alpha = \Pr[\text{reject } H_0 \mid H_0 \text{ true}] = \Pr[X = n \mid p = 1/2] = 2^{-n}.$$ Now for a "large" $n$, this might be an extremely strict test--if $n = 20$, the chance you would erroneously reject $H_0$ when the coin is fair is approximately $9.5 \times 10^{-7}$, or less than $1$ in a million. But our intuition suggests that this is perhaps too strict. After all, even if the coin is severely biased towards heads, say $p = 0.9$, the chance that all $20$ flips will be heads is only $$\Pr[X = 20 \mid p = 0.9] = (0.9)^{20} \approx 0.121557.$$ This quantity we just computed is called the power of the test for the case $p = 0.9$, and represents the probability of correctly rejecting the null when it is false. Having only a $12\%$ chance to do this when the coin is turning up heads $90\%$ of the time seems, well...suboptimal. But that's the price we pay for having such a tiny chance of being wrong about the coin being unfair.

If we are willing to accept a higher chance of Type I error, then we can construct a more reasonable test. And this brings us back to the beginning of this answer: if we are willing to accept, say, a $5\%$ chance of wrongly saying the coin is biased toward heads when it is in fact fair, then $\alpha = 0.05$, and we can optimize our rejection condition so that when the coin is fair, we will only have at most a $5\%$ chance that the test will reject the null. To do this for $n = 20$, we need to find a value, say $x_{\text{crit}}$, for which if we observe at least $x_{\text{crit}}$ heads, we say we saw too many to reasonably maintain that the coin is fair. But this choice must guarantee that a fair coin would only get at least that many heads at most $5\%$ of the time: that is, we require $$\alpha = \Pr[\text{reject } H_0 \mid H_0 \text{ true}] = \Pr[x_{\text{crit}} \le X \le n \mid p = 1/2].$$ Note that the rejection region is not $X = x_{\text{crit}}$. This is because if we want to conclude that the coin is unfair if we saw $X = 17$ heads out of $n = 20$ trials, certainly seeing $18$, $19$, or $20$ heads should also lead us to conclude it's unfair. In other words, rejecting the coin is fair when $X = x_{\text{crit}}$ should also make us reject when $X > x_{\text{crit}}$.

So how do we calculate this value when $n = 20$ and $\alpha = 0.05$? We can construct a table:

$$\begin{array}{c|c|c} x & \Pr[X = x \mid p = 1/2] & \Pr[x \le X \le 20 \mid p = 1/2] \\ \hline 20 & 9.5367 \times 10^{-7} & 9.5367 \times 10^{-7} \\ 19 & 0.0000190735 & 0.0000200272 \\ 18 & 0.000181198 & 0.000201225 \\ 17 & 0.00108719 & 0.00128841 \\ 16 & 0.00462055 & 0.00590897 \\ 15 & 0.0147858 & 0.0206947 \\ 14 & 0.0369644 & 0.0576591 \\ \end{array}$$

Notice that I started at the upper end of the range, and that the third column is just the sum of the values in the second column up to the same row. Now we can read off the value of $x_{\text{crit}} = 15$ as the smallest value in the first column that corresponds to a value in the third column that does not exceed $\alpha = 0.05$. So our rejection region for this test is $15 \le X \le 20$.

At this point, you should see how our discussion answers your question about why we don't consider individual outcomes. In your example with $n = 8$, the probability $$\Pr[X = 7 \mid p = 1/2] = 0.3125,$$ which by itself might be smaller than $\alpha = 0.05$, but it is total probability of being in the rejection region that we require to be limited by $\alpha$, not any specific outcome in that rejection region.

As for your second question, this should also be evident from our discussion. The alternative hypothesis that the coin is biased toward heads means that if $X$ counts the number of heads, then $X \ge x_{\text{crit}}$ is the form of the rejection region, because the more heads we observe, the more evidence we have that the coin is biased toward heads. However, if our statistic counted the number of tails, then the rejection region would need to be of the form $X \le x_{\text{crit}}$, because the fewer tails we see means the more heads we see.

Finally, we should point out that a funny thing happens because of the discrete nature of the binomial distribution. If our tolerance for Type I error were $3\%$ instead of $5\%$, the table above would give us the same rejection region. This reflects the inability to observe an outcome that is somewhere between $14$ and $15$ heads, so we cannot "spend" our Type I error tolerance efficiently unless it happens to be exactly one of those numbers in the third column of the table.

As an exercise, what is the power of the test when $p = 0.9$ for the rejection region $15 \le X \le 20$; i.e., what is $$\Pr[15 \le X \le 20 \mid p = 0.9]?$$ And for the same region, what is the power of the test when $p = 0.51$? Why are these probabilities different?

Related Question