[Math] Statistics – Confidence Interval for a Population Proportion Formula

statistics

I am having a conceptual hard time understanding where this formula came from. It does not seem to make any sense to me. Could someone shed some light on this:

"The natural estimator p is $\hat p = \frac{X}{n}$, the same fraction of success. Since $\hat p$ is just $X$ multiplied by a constant, $\hat p$ has an approximately normal distribution. $E(\hat p) = p$ and $\sigma_{\hat p} = \sqrt{\frac{p(1-p)}{n}}$. Standardizing, this implies that:

$$P\left(-z_{\alpha/2} < \frac{\hat p- p}{\sqrt{p(1-p)/n}} < z_{\alpha/2}\right) \approx 1 – \alpha $$"

Could someone derive this equation so that it makes sense to someone who's never seen this before?

Best Answer

Here $p$ is the probability of a success in a single trial, $n$ is the number of trials, and $X$ is a random variable representing the number of successes in a run of $n$ trials. On average you would expect to get $pn$ successes, expected value of $X$ is $pn$. The expected value of $\frac{X}n$ is therefore $\frac{pn}n=p$. If we define $\hat p$ to be $\frac{X}n$, the fraction of trials that are successes, we expect it to be about $p$, give or take a bit.

That ‘give or take a bit’ is measured by the standard deviation $\sigma_{\hat p}=\sqrt{\frac{p(1-p)}n}$. Now when $n$ is reasonably large, both $X$, the number of successes, and $\hat p$, the fraction of successes, are approximately normally distributed. The mean of the normal distribution that approximates $\hat p$ is $p$, the expected value of $\hat p$, and its standard deviation is $\sigma_{\hat p}$.

The standard normal distribution, however, has a mean of $0$ and a standard deviation of $1$, so in order to use the standard tables prepared from it, you have to shift and rescale the distribution of $\hat p$ so that it has a mean of $0$ and a standard deviation of $1$. The formula does this in two steps. First it replaces $\hat p$ by $\hat p-p$. Thus, if you get a sample that hits the expected fraction of successes on the nose, your $\hat p$ will equal $p$, and $\hat p-p$ will be $0$. More generally, $\hat p-p$ measures measures the deviation of your sample’s fraction of successes from the expected fraction; when $\hat p-p>0$, you got more than the average number of successes, and when $\hat p-p<0$, you got fewer. Subtracting $p$ from $\hat p$ just shifts the centre of the distribution from $p$ down to $0$: since the mean value of $\hat p$ is $p$, the mean value of $\hat p-p$ is $0$.

Then we want to rescale the distribution to standardize its spread, so that it has a standard deviation of $1$. Right now its standard deviation is still $\sigma_{\hat p}$: shifting it down by $p$ units doesn’t change its shape, or how must it’s spread out. We want to convert each $\sigma_{\hat p}$ units of spread to $1$ unit. This is like wanting to convert a spread in feet to a spread in yards: you have to divide by the $3$ feet per yard. Here I have to divide by the $\sigma_{\hat p}$ $\hat p$-units per standard unit; the result is

$$\frac{\hat p-p}{\sigma_{\hat p}}=\frac{\hat p-p}{\sqrt{p(1-p)/n}}\;.$$

For convenience let’s call this quantity $Y$. $Y$ is $\hat p$ shifted down by $p$ units and rescaled to have a standard deviation of $1$. Since $\hat p$ is approximately normally distributed, so is $Y$, and its mean and standard deviation are $0$ and $1$, respectively. This means that $Y$ is approximated by the standard normal distribution.

Suppose that $Z$ is a random variable with the standard normal distribution. If $0<\alpha\le\frac12$, $z_\alpha$ is by definition the number with the property that $P(Z\ge z_\alpha)=\alpha$. Thus, $P(Z\ge z_{\alpha/2})=\alpha/2$. The standard normal distribution is symmetric about its mean $0$, so $P(Z\le -z_{\alpha/2})=\alpha/2$ as well. Thus, $$P(Z\le -z_{\alpha/2}\text{ or }Z\ge z_{\alpha/2})=\frac{\alpha}2+\frac{\alpha}2=\alpha\;,$$ and hence $$P(-z_{\alpha/2}<Z<z_{\alpha/2})=1-\alpha\;:$$ the probability that $Z$ falls between $-z_{\alpha/2}$ and $z_{\alpha/2}$ is $1-\alpha$.

Finally, $Y$ is distributed approximately like $Z$, with the same mean and standard deviation, so the same is approximately true of $Y$:

$$P(-z_{\alpha/2}<Y<z_{\alpha/2})\approx 1-\alpha\;.\tag{1}$$

And since $$Y=\frac{\hat p-p}{\sigma_{\hat p}}=\frac{\hat p-p}{\sqrt{p(1-p)/n}}\;,$$ $(1)$ reduces to

$$P\left(-z_{\alpha/2} < \frac{\hat p- p}{\sqrt{p(1-p)/n}} < z_{\alpha/2}\right) \approx 1 - \alpha\;.$$

Related Question