Hypothesis Testing Standard Error – What is the Standard Error for Distribution of Difference in Proportions?

hypothesis testingproportion;standard error

I am looking for the standard error for the distribution of the difference in proportions for hypothesis testing when the null hypothesis is that the two proportions are different by a constant.

Let's say that: $H_0: {p}_1 = {p}_2 + c$. Then I've seen that $\sigma_{\hat{p}_1 – \hat{p}_2}=\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$

Is that true, or is there another formula one should use? (say:
$H_0: \hat{p}_1 = \hat{p}_2 + c$. Then I've seen that $\sigma_{\hat{p}_1 – \hat{p}_2}=\sqrt{\frac{(\hat{p}_2 + c)(1-(\hat{p}_2 + c))}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$)

p.s: if we are on the subject, why (or where can I read about), why the difference of proportions is asymptotically normal? I know that for a single sample test, it is based on the central limit theorem (since the variance is decided based on the null hypothesis). But for the difference in proportions the variance is estimated, so this should have been a t-test. So while the large N would move this to Z, I wonder if there is any other reason to consider. Other then that the difference of two normals is normal.

Best Answer

The proportions $p_1$ and $p_2$ you use in the variance should be the true proportions under the null hypothesis, where $p_1=p_2+c$. Obviously you don't know what these are (other than $c$) so you have to estimate them from your data. Either of the formulae you have reasonable estimates, but it is possible to combine them.

Consider we start with the obvious estimates $\hat{p_1}=\frac{X_1}{n_1}$ and $\hat{p_2}=\frac{X_2}{n_2}$. This won't do because they are unlikely to be exactly $c$ different. Consider an alternative estimate of $\hat{p_1}=\frac{X_2}{n_2}+c=\hat{p_2}+c$. Then you might construct a weighted average of your two estimates of $p_1$:

$\hat{p_{1b}}=\frac{n_2(\hat{p_2}+c)+n_1\hat{p_1}}{n_1+n_2}$

which reduces to:

$\hat{p_{1b}}=\frac{X_2+n_{2}c+X_1}{n_1+n_2}$

So then I'd use that in the formula to estimate the standard error of the distribution

$\sigma_{\hat{p}_{1b} - \hat{p}_2}=\sqrt{\frac{\hat{p}_{1b}(1-\hat{p}_{1b})}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$

In practice, my experience is that this sort of thing doesn't make much difference...

With regard to your post script, the reason the difference is asymptotically normal is, as you guess, because the linear combination of two normal distributed random variables is also normally distributed (and the same goes when the normality is only asymptotic).

Related Question