Here's a commonly seen derivation.
Consider throwing a dart at a dartboard, aiming at the origin. We make the assumptions that:
- The errors in the horizontal and vertical directions are independent and identically distributed
- Errors are isotropic, i.e. the magnitude of the error doesn't depend on direction
- The chance of the dart landing in a small area is proportional to the area
- Large errors are less likely than small errors
Say that the probability of landing in a strip centered on $x$ and of width $\Delta x$ is $p(x)\Delta x$, and similarly $p(y)\Delta y$ for landing in a strip centered at $y$ of width $\Delta y$.
Since horizontal and vertical errors are independent we can multiply these probabilities to get the probability of landing in a box at $(x,y)$ of size $\Delta x\Delta y$. By assumptions (1) and (2) this should only depend on the distance from the origin, and should also be proportional to $\Delta x\Delta y$:
$$p(x)\Delta x \cdot p(y)\Delta y = f(r) \Delta x \Delta y$$
which tells us that
$$p(x)p(y) = f(r)$$
If we differentiate with respect to the angular coordinate $\theta$ on both sides, we get
$$p(x) \frac{\partial p(y)}{\partial \theta} + p(y) \frac{\partial p(x)}{\partial \theta} = 0$$
which, using polar coordinates $x=r\cos\theta$ and $y=r\sin\theta$ becomes
$$p(x)p'(y) r\cos\theta - p(y)p'(x) r\sin\theta = 0$$
or
$$p(x)p'(y)x - p(y)p'(x) y = 0$$
which can be rearranged to
$$\frac{p(x)x}{p'(x)} = \frac{p(y)y}{p'(y)}$$
Since both sides are functions of an independent variable and yet are equal, they must be equal to a constant:
$$xp(x) = Cp'(x)$$
and we can now separate variables and integrate, getting
$$\frac{x^2}{2} = C\log p(x) + b$$
or
$$p(x) = A \exp\left(\frac{x^2}{2C} \right)$$
Now the assumption that large errors are less likely than small tells us that $C<0$, and the constant $A$ is determined by the requirement that the probability integrates to 1.
Best Answer
Here $p$ is the probability of a success in a single trial, $n$ is the number of trials, and $X$ is a random variable representing the number of successes in a run of $n$ trials. On average you would expect to get $pn$ successes, expected value of $X$ is $pn$. The expected value of $\frac{X}n$ is therefore $\frac{pn}n=p$. If we define $\hat p$ to be $\frac{X}n$, the fraction of trials that are successes, we expect it to be about $p$, give or take a bit.
That ‘give or take a bit’ is measured by the standard deviation $\sigma_{\hat p}=\sqrt{\frac{p(1-p)}n}$. Now when $n$ is reasonably large, both $X$, the number of successes, and $\hat p$, the fraction of successes, are approximately normally distributed. The mean of the normal distribution that approximates $\hat p$ is $p$, the expected value of $\hat p$, and its standard deviation is $\sigma_{\hat p}$.
The standard normal distribution, however, has a mean of $0$ and a standard deviation of $1$, so in order to use the standard tables prepared from it, you have to shift and rescale the distribution of $\hat p$ so that it has a mean of $0$ and a standard deviation of $1$. The formula does this in two steps. First it replaces $\hat p$ by $\hat p-p$. Thus, if you get a sample that hits the expected fraction of successes on the nose, your $\hat p$ will equal $p$, and $\hat p-p$ will be $0$. More generally, $\hat p-p$ measures measures the deviation of your sample’s fraction of successes from the expected fraction; when $\hat p-p>0$, you got more than the average number of successes, and when $\hat p-p<0$, you got fewer. Subtracting $p$ from $\hat p$ just shifts the centre of the distribution from $p$ down to $0$: since the mean value of $\hat p$ is $p$, the mean value of $\hat p-p$ is $0$.
Then we want to rescale the distribution to standardize its spread, so that it has a standard deviation of $1$. Right now its standard deviation is still $\sigma_{\hat p}$: shifting it down by $p$ units doesn’t change its shape, or how must it’s spread out. We want to convert each $\sigma_{\hat p}$ units of spread to $1$ unit. This is like wanting to convert a spread in feet to a spread in yards: you have to divide by the $3$ feet per yard. Here I have to divide by the $\sigma_{\hat p}$ $\hat p$-units per standard unit; the result is
$$\frac{\hat p-p}{\sigma_{\hat p}}=\frac{\hat p-p}{\sqrt{p(1-p)/n}}\;.$$
For convenience let’s call this quantity $Y$. $Y$ is $\hat p$ shifted down by $p$ units and rescaled to have a standard deviation of $1$. Since $\hat p$ is approximately normally distributed, so is $Y$, and its mean and standard deviation are $0$ and $1$, respectively. This means that $Y$ is approximated by the standard normal distribution.
Suppose that $Z$ is a random variable with the standard normal distribution. If $0<\alpha\le\frac12$, $z_\alpha$ is by definition the number with the property that $P(Z\ge z_\alpha)=\alpha$. Thus, $P(Z\ge z_{\alpha/2})=\alpha/2$. The standard normal distribution is symmetric about its mean $0$, so $P(Z\le -z_{\alpha/2})=\alpha/2$ as well. Thus, $$P(Z\le -z_{\alpha/2}\text{ or }Z\ge z_{\alpha/2})=\frac{\alpha}2+\frac{\alpha}2=\alpha\;,$$ and hence $$P(-z_{\alpha/2}<Z<z_{\alpha/2})=1-\alpha\;:$$ the probability that $Z$ falls between $-z_{\alpha/2}$ and $z_{\alpha/2}$ is $1-\alpha$.
Finally, $Y$ is distributed approximately like $Z$, with the same mean and standard deviation, so the same is approximately true of $Y$:
$$P(-z_{\alpha/2}<Y<z_{\alpha/2})\approx 1-\alpha\;.\tag{1}$$
And since $$Y=\frac{\hat p-p}{\sigma_{\hat p}}=\frac{\hat p-p}{\sqrt{p(1-p)/n}}\;,$$ $(1)$ reduces to
$$P\left(-z_{\alpha/2} < \frac{\hat p- p}{\sqrt{p(1-p)/n}} < z_{\alpha/2}\right) \approx 1 - \alpha\;.$$