Question about the p-value in one-sided hypothesis testing compared to two-sided hypothesis testing

hypothesis testingone-tailed testp-value

I am watching an online tutorial about one-sided hypothesis testing versus two-sided hypothesis testing. This tutorial discusses when you should use one-sided hypothesis testing compared to two-sided hypothesis testing. One of the things mentioned by the tutorial is that one-sided tests have greater power (on the side favoured by the alternative hypothesis) compared to two-sided tests.

enter image description here

However, the tutorial then mentions the following:

enter image description here

Regarding the above statement, there is something that I don't fully understand. Would it not be the case that if you chose the alternative hypothesis based on the direction observed in the sample, then the p-value should be double of what it should be? Since say I chose my alternative hypothesis in the left direction (I expect that there is a difference below the null value). Then there would be greater power on that side, and it will be easier for me to reject the null hypothesis (since I can do this when I have obtained a p-value of 0.05, in contrast to the two-tailed test where I need to get a p-value of 0.025 to reject the null hypothesis).

Any insights are appreciated.

Best Answer

Suppose a previous process for making a particular kind of steel wire yielded wire with breaking strength $\mathsf{Norm}(\mu=50,\sigma=5).$ A new process is now in use and we would like to know if the breaking strength has changed. If different, we have no basis for guessing whether it is higher or lower.

Now $n = 42$ test specimens of the new wire are available and their breaking strengths, recorded in vector x have been determined. A change of $2$ or more would be a practical importance.

We wish to use a two-sided, one-sample t test, at the 5% level, of $H_0: \mu=50$ against the alternative $H_a: \mu \ne 50.$ In R, the relevant test gives the following output. The result of this two-sided test is not significant at the 5% level.

t.test(x, mu=50)

        One Sample t-test

data:  x
t = 1.9969, df = 41, p-value = 0.0525
alternative hypothesis: 
 true mean is not equal to 50
95 percent confidence interval:
 49.97994 53.56558
sample estimates:
mean of x 
  51.77276 

Before the specimens from the new process were measured for breaking strength, we used the standard deviation $\sigma=5$ and the important difference $\Delta = 2$ to see how many specimens should be used for the test. We determined that $n=45$ specimens would suffice to give power (probability of detecting a real difference of size $\Delta=2)$ about 75%. So the test was not 'sure' to give a significant result even if there is a real difference. To make matters a little worse, we got only $n=42$ specimens.

set.seed(1005)
pv = replicate(10^5, t.test(rnorm(45, 52, 5), mu=50)$p.val)
mean(pv <= 0.05)
[1] 0.74662

Now suppose someone notices that the sample mean $\bar X = 51.77$ is larger than $\mu_0 = 50$ and suggests that we could get a P-value smaller than the magical 5% level by doing a one-sided test, as shown below. The P-value of the right-sided test is half the P-value of the two-sided test.

t.test(x, mu=50, alt="greater")

        One Sample t-test

data:  x
t = 1.9969, df = 41, p-value = 0.02625
alternative hypothesis: 
 true mean is greater than 50
95 percent confidence interval:
 50.27881      Inf
sample estimates:
mean of x 
 51.77276 

There are several things wrong with using this one-sided test to declare that the new process differs significantly from the old one. Here are a few.

  • We set out to test for a change in either direction. Now a second analysis of the same data has 'declared' an increase with significance barely below the 5% level. This is "P-hacking," which can lead to "false discovery."

  • The 95% confidence interval for $\mu$ from the two-sided test is $(49.98,\, 53.57),$ which includes the hypothetical value 50 (if only just barely).

  • The actual difference between $\mu=50$ and $\bar X = 51.77$ is less than the 2 units we said is of practical importance.

  • We had planned a somewhat skimpy sample size of 45 in our power computation and finally had only 42 available. Maybe the new process is different than the old, and maybe not. We don't have enough data to say it is.


Note: The fictitious data used above was sampled in R as shown below. Of course, in a real-life application the exact population parameters would never be known.

set.seed(2021)
x = rnorm(42, 52, 5)