Why is the normal probability curve used to approximate the binomial probability distribution

binomial distributionnormal distribution

Background: I'm a psychology/behavioural science student. I'm trying to teach myself some stats stuff which goes beyond the scope of my current syllabus.

Question: Quoting from Chapter 6: The Normal Probability Distribution, Introduction to Probability and Statistics by Mendenhall, Beaver and Beaver (14th Ed.),

Since the normal distribution is continuous, the area under the curve at any single point is equal to $0$. Keep in mind that this result applies only to continuous random variables. Because the binomial random variable $x$ is a discrete random variable, the probability that $x$ takes some speciﬁc value—say, $x =11$ —will not necessarily equal $0$.

As far as my understanding goes, the normal probability distribution is used for continuous random variables (as also stated above), so why is it being used for approximating binomial probability distributions, which are discrete random variables? How is this approximation justified when a discrete random variable is capable of taking a certain value with a specific probability, but for a continuous random variable, the probability of it taking a specific value is $0$?

Extra: Kindly suggest corrections for the above question in case of erroneous statements.

Best Answer

This is interesting. We have many questions here asking about details of the steps of a method to use the normal distribution to approximate the binomial distribution. Few if any, however, ask what makes this method a valid method in the first place.

It is true that when you look closely, the probability densities of binomial distribution and a normal distribution are quite different. In the binomial distribution, all the probability is concentrated at a finite number of points.

But let's try a different representation of the binomial variable $B$. Instead of actually plotting the function, take each possible outcome $k$ of the distribution and construct a rectangle of width $1$ and height equal to the probability of that outcome, $p_B(k).$ Then put that rectangle upright on the $x$ axis of a graph, so that the centerline of the rectangle lies on the line $x = k.$ When you do this, you get something like the colored rectangles in the figure below:

(Original image here.)

Looking at a graph like this, you might notice that the rectangles derived from the binomial distribution look a lot like a Riemann sum of a normal distribution. In the figure above you can see that they come close to being a "midpoint" Riemann sum of the superimposed normal density. The middle bar is just a little too short, and if you look closely the other bars are not quite the right height either. But this is just a simple example for illustration. If the binomial distribution represented a much larger number of trials, for example $100$ trials instead of just $6,$ the rectangles would be a much better approximation of a "midpoint" Riemann sum, as long as you don't look too far into the "tails" of the normal distribution (where the binomial probability will be zero although the normal density remains positive).

The observation that makes the normal approximation work is that if you take some sequence of adjacent rectangles, they are in fact a kind of Riemann sum, not exactly the "midpoint" sum but still a relatively accurate one, approximating the area under the normal distribution between the leftmost edge of the leftmost rectangle and the rightmost edge of the rightmost rectangle. And an approximation that works in one direction works just as well in the other: the area under the normal distribution is a good approximation of the sum of the areas of the rectangles, which is the sum of probabilities of a range of outcomes of the binomial.

For example, consider a binomial variable $X$ with probability $p = \frac12$ for each trial and with $n = 30$ trials, and suppose we want the probability that $7\leq X \leq 9.$ We construct rectangles for $P(X=7),$ $P(X=8),$ and $P(X=9).$ Those rectangles lie between the lines $x = 6.5$ and $x = 9.5.$ If $f_N$ is the density of a normal distribution with the same mean and variance as the binomial, the rectangles provide an approximation of the area under the normal distribution between those lines:

\begin{multline} \int_{6.5}^{9.5} f_N(t)\,dt \approx (7.5 - 6.5) f_N(7) + (8.5 - 7.5) f_N(8) + (9.5 - 8.5) f_N(9) \\ = f_N(7) + f_N(8) + f_N(9). \end{multline}

Note: Actually proving that this approximation is a good one, rather than appealing to graphical intuition, is part of one of the most important theorems of mathematical probability.

As already noted, the approximation is far from perfect. It is not good for a small number of trials, and it is not good in the "tails" of the normal distribution. It also tends not to be as good when the single-trial probability $p$ of the binomial is very close to $0$ or $1$ as it is when $p \approx \frac12.$ These issues are discussed in Normal approximation to the binomial distribution and Normal approximation of binomial distribution - limits, among other places.

Because of these issues, the normal distribution is not recommended if you want to know the probability that the binomial variable will take its smallest value, or even one of its three smallest values. We might use it for estimating the probability of a range of several outcomes nearer the middle of the binomial distribution, or the probability that the outcome is no greater than $k$ (which is not too close to the minimum or maximum value).

Related Solutions

Normal Distribution Model for Discrete Probability – Is It Possible?

Your ultimate goal is not clear. Perhaps I can flounder around and make some useful comments.

For appropriate choices of $n$ and $\theta,$ the distribution $Binom(n, \theta)$ is approximately normal, especially if $n$ is large and $\theta$ is not too far from 1/2. The mean is $\mu = n\theta$ and the variance is $\sigma^2 = n\theta(1-\theta).$

Also, for large enough $\lambda,$ the distribution $Pois(\lambda)$ is nearly normal. The mean and variance are $\mu = \lambda$ and $\sigma^2 = \lambda.$ However, the Poisson model may have less flexibility in matching what you want.

Of course, to find the probability that a random variable taking integer values lies in an interval $(a, b]$ you will add probabilities for integer values in that interval, rather than evaluating an integral.

For example, if $X \sim Binom(n = 100, \theta = 1/2),$ you have $\mu = 50$ and $\sigma = 5.$ Perhaps you want

$$P(48 < X \le 52) = P(X = 49) + P(X = 50) + P(X = 51) + P(X = 52)\\ = P(X \le 52) -P(X \le 48) = F_X(52) - F_X(48) = 0.3091736,$$ where $F_X(\cdot)$ is the CDF of $X.$

If there are many integers in the desired interval, computation by hand can be tedious. In R statistical software dbinom denotes a binomial PDF and pbinom a binomial CDF.

The probability above could be evaluated in R as shown below. [The last value is a normal approximation (with continuity correction), which is often accurate to a couple of decimal places.]

 sum(dbinom(49:52, 100, .5))      # adding terms of the PDF
 ## 0.3091736
 diff(pbinom(c(48,52), 100, .5))  # subtracting two CDF values
 ## 0.3091736
 diff(pnorm(c(48.5,52.5), 50, 5)) # normal approximation
 ## 0.3093739

The figure below shows several values of the PDF of $Binom(100, .5),$ emphasizes the four probabilities required (heights of thick blue bars), and shows the approximating normal density curve. The normal approximation is the area beneath the curve between the vertical green lines.

[Math] Probability at specific point in normal distribution curve

There are a number of ways to look at this.

First, intuitively, the "size" of a single point in comparison to the real line is negligible. As an analogy, consider a population of size $10^{500}$. If we choose one element uniformly at random, then that single element occurs with probability $10^{-500} \approx 0$. The real line is "infinitely bigger" than a population of size $10^{500}$.

Formally, this is a consequence of the integral definition. For a continuous random variable $X$ with probability density function $f$, we define $$P(a \leq X \leq b) = \int_a^b f(x)\ dx.$$ If we only care about a single point, then we get $$P(X = a) = P(a \leq X \leq a) = \int_a^a f(x)\ dx = 0.$$ This holds for any continuous random variable, and normally distributed ones in particular.

If you want to get really technical, then this is a consequence of the fact that finite sets have Lebesgue measure zero. In a sense, sets like $\{a\}$ are simply too small to integrate over without getting $0$.

Best Answer

Related Solutions

Normal Distribution Model for Discrete Probability – Is It Possible?

[Math] Probability at specific point in normal distribution curve

Related Question