# Solved – What intuitive explanation is there for the central limit theorem

central limit theoremintuition

In several different contexts we invoke the central limit theorem to justify whatever statistical method we want to adopt (e.g., approximate the binomial distribution by a normal distribution). I understand the technical details as to why the theorem is true but it just now occurred to me that I do not really understand the intuition behind the central limit theorem.

So, what is the intuition behind the central limit theorem?

Layman explanations would be ideal. If some technical detail is needed please assume that I understand the concepts of a pdf, cdf, random variable etc but have no knowledge of convergence concepts, characteristic functions or anything to do with measure theory.

I apologize in advance for the length of this post: it is with some trepidation that I let it out in public at all, because it takes some time and attention to read through and undoubtedly has typographic errors and expository lapses. But here it is for those who are interested in the fascinating topic, offered in the hope that it will encourage you to identify one or more of the many parts of the CLT for further elaboration in responses of your own.

Most attempts at "explaining" the CLT are illustrations or just restatements that assert it is true. A really penetrating, correct explanation would have to explain an awful lot of things.

Before looking at this further, let's be clear about what the CLT says. As you all know, there are versions that vary in their generality. The common context is a sequence of random variables, which are certain kinds of functions on a common probability space. For intuitive explanations that hold up rigorously I find it helpful to think of a probability space as a box with distinguishable objects. It doesn't matter what those objects are but I will call them "tickets." We make one "observation" of a box by thoroughly mixing up the tickets and drawing one out; that ticket constitutes the observation. After recording it for later analysis we return the ticket to the box so that its contents remain unchanged. A "random variable" basically is a number written on each ticket.

In 1733, Abraham de Moivre considered the case of a single box where the numbers on the tickets are only zeros and ones ("Bernoulli trials"), with some of each number present. He imagined making $$n$$ physically independent observations, yielding a sequence of values $$x_1, x_2, \ldots, x_n$$, all of which are zero or one. The sum of those values, $$y_n = x_1 + x_2 + \ldots + x_n$$, is random because the terms in the sum are. Therefore, if we could repeat this procedure many times, various sums (whole numbers ranging from $$0$$ through $$n$$) would appear with various frequencies--proportions of the total. (See the histograms below.)

Now one would expect--and it's true--that for very large values of $$n$$, all the frequencies would be quite small. If we were to be so bold (or foolish) as to attempt to "take a limit" or "let $$n$$ go to $$\infty$$", we would conclude correctly that all frequencies reduce to $$0$$. But if we simply draw a histogram of the frequencies, without paying any attention to how its axes are labeled, we see that the histograms for large $$n$$ all begin to look the same: in some sense, these histograms approach a limit even though the frequencies themselves all go to zero.

These histograms depict the results of repeating the procedure of obtaining $$y_n$$ many times. $$n$$ is the "number of trials" in the titles.

The insight here is to draw the histogram first and label its axes later. With large $$n$$ the histogram covers a large range of values centered around $$n/2$$ (on the horizontal axis) and a vanishingly small interval of values (on the vertical axis), because the individual frequencies grow quite small. Fitting this curve into the plotting region has therefore required both a shifting and rescaling of the histogram. The mathematical description of this is that for each $$n$$ we can choose some central value $$m_n$$ (not necessarily unique!) to position the histogram and some scale value $$s_n$$ (not necessarily unique!) to make it fit within the axes. This can be done mathematically by changing $$y_n$$ to $$z_n = (y_n - m_n) / s_n$$.

Remember that a histogram represents frequencies by areas between it and the horizontal axis. The eventual stability of these histograms for large values of $$n$$ should therefore be stated in terms of area. So, pick any interval of values you like, say from $$a$$ to $$b \gt a$$ and, as $$n$$ increases, track the area of the part of the histogram of $$z_n$$ that horizontally spans the interval $$(a, b]$$. The CLT asserts several things:

1. No matter what $$a$$ and $$b$$ are, if we choose the sequences $$m_n$$ and $$s_n$$ appropriately (in a way that does not depend on $$a$$ or $$b$$ at all), this area indeed approaches a limit as $$n$$ gets large.

2. The sequences $$m_n$$ and $$s_n$$ can be chosen in a way that depends only on $$n$$, the average of values in the box, and some measure of spread of those values--but on nothing else--so that regardless of what is in the box, the limit is always the same. (This universality property is amazing.)

3. Specifically, that limiting area is the area under the curve $$y = \exp(-z^2/2) / \sqrt{2 \pi}$$ between $$a$$ and $$b$$: this is the formula of that universal limiting histogram.

The first generalization of the CLT adds,

1. When the box can contain numbers in addition to zeros and ones, exactly the same conclusions hold (provided that the proportions of extremely large or small numbers in the box are not "too great," a criterion that has a precise and simple quantitative statement).

The next generalization, and perhaps the most amazing one, replaces this single box of tickets with an ordered indefinitely long array of boxes with tickets. Each box can have different numbers on its tickets in different proportions. The observation $$x_1$$ is made by drawing a ticket from the first box, $$x_2$$ comes from the second box, and so on.

1. Exactly the same conclusions hold provided the contents of the boxes are "not too different" (there are several precise, but different, quantitative characterizations of what "not too different" has to mean; they allow an astonishing amount of latitude).

These five assertions, at a minimum, need explaining. There's more. Several intriguing aspects of the setup are implicit in all the statements. For example,

• What is special about the sum? Why don't we have central limit theorems for other mathematical combinations of numbers such as their product or their maximum? (It turns out we do, but they are not quite so general nor do they always have such a clean, simple conclusion unless they can be reduced to the CLT.) The sequences of $$m_n$$ and $$s_n$$ are not unique but they're almost unique in the sense that eventually they have to approximate the expectation of the sum of $$n$$ tickets and the standard deviation of the sum, respectively (which, in the first two statements of the CLT, equals $$\sqrt{n}$$ times the standard deviation of the box).

The standard deviation is one measure of the spread of values, but it is by no means the only one nor is it the most "natural," either historically or for many applications. (Many people would choose something like a median absolute deviation from the median, for instance.)

• Why does the SD appear in such an essential way?

• Consider the formula for the limiting histogram: who would have expected it to take such a form? It says the logarithm of the probability density is a quadratic function. Why? Is there some intuitive or clear, compelling explanation for this?

I confess I am unable to reach the ultimate goal of supplying answers that are simple enough to meet Srikant's challenging criteria for intuitiveness and simplicity, but I have sketched this background in the hope that others might be inspired to fill in some of the many gaps. I think a good demonstration will ultimately have to rely on an elementary analysis of how values between $$\alpha_n = a s_n + m_n$$ and $$\beta_n = b s_n + m_n$$ can arise in forming the sum $$x_1 + x_2 + \ldots + x_n$$. Going back to the single-box version of the CLT, the case of a symmetric distribution is simpler to handle: its median equals its mean, so there's a 50% chance that $$x_i$$ will be less than the box's mean and a 50% chance that $$x_i$$ will be greater than its mean. Moreover, when $$n$$ is sufficiently large, the positive deviations from the mean ought to compensate for the negative deviations in the mean. (This requires some careful justification, not just hand waving.) Thus we ought primarily to be concerned about counting the numbers of positive and negative deviations and only have a secondary concern about their sizes. (Of all the things I have written here, this might be the most useful at providing some intuition about why the CLT works. Indeed, the technical assumptions needed to make the generalizations of the CLT true essentially are various ways of ruling out the possibility that rare huge deviations will upset the balance enough to prevent the limiting histogram from arising.)

This shows, to some degree anyway, why the first generalization of the CLT does not really uncover anything that was not in de Moivre's original Bernoulli trial version.

At this point it looks like there is nothing for it but to do a little math: we need to count the number of distinct ways in which the number of positive deviations from the mean can differ from the number of negative deviations by any predetermined value $$k$$, where evidently $$k$$ is one of $$-n, -n+2, \ldots, n-2, n$$. But because vanishingly small errors will disappear in the limit, we don't have to count precisely; we only need to approximate the counts. To this end it suffices to know that

$$\text{The number of ways to obtain } k \text{ positive and } n-k \text{ negative values out of } n$$

$$\text{equals } \frac{n-k+1}{k}$$

$$\text{times the number of ways to get } k-1 \text{ positive and } n-k+1 \text { negative values.}$$

(That's a perfectly elementary result so I won't bother to write down the justification.) Now we approximate wholesale. The maximum frequency occurs when $$k$$ is as close to $$n/2$$ as possible (also elementary). Let's write $$m = n/2$$. Then, relative to the maximum frequency, the frequency of $$m+j+1$$ positive deviations ($$j \ge 0$$) is estimated by the product

$$\frac{m+1}{m+1} \frac{m}{m+2} \cdots \frac{m-j+1}{m+j+1}$$

$$=\frac{1 - 1/(m+1)}{1 + 1/(m+1)} \frac{1-2/(m+1)}{1+2/(m+1)} \cdots \frac{1-j/(m+1)}{1+j/(m+1)}.$$

135 years before de Moivre was writing, John Napier invented logarithms to simplify multiplication, so let's take advantage of this. Using the approximation

$$\log\left(\frac{1-x}{1+x}\right) = -2x - \frac{2x^3}{3} + O(x^5),$$

we find that the log of the relative frequency is approximately

$$-\frac{2}{m+1}\left(1 + 2 + \cdots + j\right) - \frac{2}{3(m+1)^3}\left(1^3+2^3+\cdots+j^3\right) = -\frac{j^2}{m} + O\left(\frac{j^4}{m^3}\right).$$

Because the error in approximating this sum by $$-j^2/m$$ is on the order of $$j^4/m^3$$, the approximation ought to work well provided $$j^4$$ is small relative to $$m^3$$. That covers a greater range of values of $$j$$ than is needed. (It suffices for the approximation to work for $$j$$ only on the order of $$\sqrt{m}$$ which asymptotically is much smaller than $$m^{3/4}$$.)

Consequently, writing $$z = \sqrt{2}\,\frac{j}{\sqrt{m}} = \frac{j/n}{1 / \sqrt{4n}}$$ for the standardized deviation, the relative frequency of deviations of size given by $$z$$ must be proportional to $$\exp(-z^2/2)$$ for large $$m.$$ Thus appears the Gaussian law of #3 above.

Obviously much more analysis of this sort should be presented to justify the other assertions in the CLT, but I'm running out of time, space, and energy and I've probably lost 90% of the people who started reading this anyway. This simple approximation, though, suggests how de Moivre might originally have suspected that there is a universal limiting distribution, that its logarithm is a quadratic function, and that the proper scale factor $$s_n$$ must be proportional to $$\sqrt{n}$$ (as shown by the denominator of the preceding formula). It is difficult to imagine how this important quantitative relationship could be explained without invoking some kind of mathematical information and reasoning; anything less would leave the precise shape of the limiting curve a complete mystery.