[Math] Why does the standard deviation change from confidence intervals to hypothesis tests

statistical-inferencestatistics

When considering two-sample data that involves a difference of proportions, both a confidence interval and a hypothesis test can be done.

The standard deviation used for a difference of proportions in creating a confidence interval is $\sqrt{\frac{p_1(1 – p_1)}{n_1} + \frac{p_2(1 – p_2)}{n_2}}$.

However, the standard deviation used for confidence intervals is $\sqrt{\frac{p(1 – p)}{n_1} + \frac{p(1 – p)}{n_2}}$, where $p = \frac{x_1 + x_2}{n_1 + n_2}$, $x_1 = p_1n_1$, and $x_2 = p_2n_2$.

What I don't understand is why these are different. They're both the standard deviation of the same proportion, so why should they differ?

Best Answer

Below, I will use $\hat{p_i}$ to indicate sample proportions and $p_i$ to indicate true values (population parameters). I will use $95\%$ intervals for demonstration.

The test of differences in proportions starts with the null hypothesis $p_1=p_2=p$. Under this assumption, $p_1-p_2$ is approximately normal with variance $\frac{p(1 - p)}{n_1} + \frac{p(1 - p)}{n_2}$ and mean $0$. When this is true, the 95% probability interval (the interval for which, if the null-hypothesis is true, the value $\hat{p_1}-\hat{p_2}$ will be within 95% of the time) is approximately $$\hat{p_1}-\hat{p_2}\in\{-1.96*\sqrt{\frac{\hat{p}(1 - \hat{p})}{n_1} + \frac{\hat{p}(1 - \hat{p})}{n_2}},1.96*\sqrt{\frac{\hat{p}(1 - \hat{p})}{n_1} + \frac{\hat{p}(1 - \hat{p})}{n_2}}\}$$

The alternate formula (with $p_1,p_2$ rather than $p$), is a less efficient estimator of the standard deviation of the sample proportion difference since the entire data set is not used to estimate $p$ and instead $p_1$ and $p_2$ are estimated separately.

On the other hand, the $95\%$ confidence interval is the set of all potential true values of $p_1-p_2$ for which the a sample value less extreme than $\hat{p_1}-\hat{p_2}$ would be generated from the same sampling procedure at least $95\%$ of the time. In constructing this interval, one could not make the assumption $p_1=p_2=p$ since asking about the potential true values about $p_1-p_2$ while making an assumption about that value is meaningless.

Without the assumption of equivalence, our best estimate of the standard deviation of the difference is the alternate formula and the approximate 95% confidence interval is given by:

$$p_1-p_2\in\{(\hat{p_1}-\hat{p_2})-1.96*\sqrt{\frac{\hat{p_1}(1 - \hat{p_1})}{n_1} + \frac{\hat{p_2}(1 - \hat{p_2})}{n_2}},(\hat{p_1}-\hat{p_2})+1.96*\sqrt{\frac{\hat{p_1}(1 - \hat{p_1})}{n_1} + \frac{\hat{p_2}(1 - \hat{p_2})}{n_2}}\}$$

Related Question