This figure shows the distributions of the means of $n=1$ (blue), $10$ (red), and $100$ (gold) independent and identically distributed (*iid*) normal distributions (of unit variance and mean $\mu$):

As $n$ increases, the distribution of the mean becomes more "focused" on $\mu$. (The sense of "focusing" is easily quantified: given any fixed open interval $(a,b)$ surrounding $\mu$, the amount of the distribution within $[a,b]$ increases with $n$ and has a limiting value of $1$.)

However, *when we standardize* these distributions, we rescale each of them to have a mean of $0$ and a unit variance: they are all the same then. This is how we see that although the PDFs of the means themselves are spiking upwards and focusing around $\mu$, nevertheless every one of these distributions is still has a Normal *shape,* even though they differ individually.

The Central Limit Theorem says that when you start with *any* distribution--not just a normal distribution--that has a finite variance, and play the same game with means of $n$ iid values as $n$ increases, you see the same thing: the mean distributions focus around the original mean (the Weak Law of Large Numbers), but the *standardized* mean distributions converge to a standard Normal distribution (the Central Limit Theorem).

This may be better appreciated by expressing the result of CLT in terms of sums of iid random variables. We have

$$\sqrt{n} \frac{ \bar{X} -\mu}{\sigma} \sim N(0, 1) \quad \text{asymptotically}$$

Multiply the quotient by $\frac{\sigma}{\sqrt{n}}$ and use the fact that $Var(cX) = c^2 Var(X)$ to get

$$\bar{X}-\mu \sim N\left(0, \frac{\sigma^2}{n} \right)$$

Now add $\mu$ to the LHS and use the fact that $\mathbb{E} \left[a X+\mu\right] = a \mathbb{E}[X] + \mu$ to obtain

$$\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i \sim N\left(\mu, \frac{\sigma^2}{n} \right)$$

Lastly, multiply by $n$ and use the above two results to see that

$$\sum_{i=1}^n X_i \sim N \left(n \mu, n\sigma^2 \right) $$

And what does this have to do with Wooldridge's statement? Well, if the error is the **sum of many iid random variables** then it will be approximately normally distributed, as just seen. But there is an issue here, namely that the unobserved factors will not necessarily be identically distributed and they might not even be independent!

Nevertheless, the CLT has been successfully extended to independent non-identically distributed random variables and even cases of mild dependence, under some additional regularity conditions. These are essentially conditions that guarantee that no term in the sum exerts disproportional influence on the asymptotic distribution, see also the wikipedia page on the CLT. You do not need to know these results of course; Wooldridge's aim is merely to provide intuition.

Hope this helps.

## Best Answer

I apologize in advance for the length of this post: it is with some trepidation that I let it out in public at all, because it takes some time and attention to read through and undoubtedly has typographic errors and expository lapses. But here it is for those who are interested in the fascinating topic, offered in the hope that it will encourage you to identify one or more of the many parts of the CLT for further elaboration in responses of your own.

Most attempts at "explaining" the CLT are illustrations or just restatements that assert it is true. A really penetrating, correct explanation would have to explain an awful lot of things.

Before looking at this further, let's be clear about what the CLT says.As you all know, there are versions that vary in their generality. The common context is a sequence of random variables, which are certain kinds of functions on a common probability space. For intuitive explanations that hold up rigorously I find it helpful to think of a probability space as a box with distinguishable objects. It doesn't matter what those objects are but I will call them "tickets." We make one "observation" of a box by thoroughly mixing up the tickets and drawing one out; that ticket constitutes the observation. After recording it for later analysis we return the ticket to the box so that its contents remain unchanged. A "random variable" basically is a number written on each ticket.In 1733, Abraham de Moivre considered the case of a single box where the numbers on the tickets are only zeros and ones ("Bernoulli trials"), with some of each number present. He imagined making $n$ physically

independentobservations, yielding a sequence of values $x_1, x_2, \ldots, x_n$, all of which are zero or one. Thesumof those values, $y_n = x_1 + x_2 + \ldots + x_n$, is random because the terms in the sum are. Therefore, if we could repeat this procedure many times, various sums (whole numbers ranging from $0$ through $n$) would appear with various frequencies--proportions of the total. (See the histograms below.)Now one would expect--and it's true--that for very large values of $n$, all the frequencies would be quite small. If we were to be so bold (or foolish) as to attempt to "take a limit" or "let $n$ go to $\infty$", we would conclude correctly that all frequencies reduce to $0$. But

if we simply draw a histogramof the frequencies, without paying any attention to how its axes are labeled, we see that the histograms for large $n$ all begin to look the same: in some sense, these histogramsapproach a limiteven though the frequencies themselves all go to zero.These histograms depict the results of repeating the procedure of obtaining $y_n$ many times. $n$ is the "number of trials" in the titles.The insight here is to

draw the histogram first and label its axes later. With large $n$ the histogram covers a large range of values centered around $n/2$ (on the horizontal axis) and a vanishingly small interval of values (on the vertical axis), because the individual frequencies grow quite small. Fitting this curve into the plotting region has therefore required both ashiftingandrescalingof the histogram. The mathematical description of this is that for each $n$ we can choose some central value $m_n$ (not necessarily unique!) to position the histogram and some scale value $s_n$ (not necessarily unique!) to make it fit within the axes. This can be done mathematically by changing $y_n$ to $z_n = (y_n - m_n) / s_n$.Remember that a histogram represents frequencies by

areasbetween it and the horizontal axis.The eventual stability of these histograms for large values of $n$ should therefore be stated in terms of area.So, pick any interval of values you like, say from $a$ to $b \gt a$ and, as $n$ increases, track the area of the part of the histogram of $z_n$ that horizontally spans the interval $(a, b]$. The CLT asserts several things:No matter what $a$ and $b$ are,if we choose the sequences $m_n$ and $s_n$ appropriately (in a way that does not depend on $a$ or $b$ at all), this area indeed approaches a limit as $n$ gets large.The sequences $m_n$ and $s_n$ can be chosen in a way that depends only on $n$, the average of values in the box, and some measure of spread of those values--but on nothing else--so that regardless of what is in the box, the limit is always the same. (This universality property is amazing.)

Specifically, that limiting area is the area under the curve $y = \exp(-z^2/2) / \sqrt{2 \pi}$ between $a$ and $b$: this is the formula of that universal limiting histogram.

The first generalization of the CLT adds,

exactly the same conclusions hold(provided that the proportions of extremely large or small numbers in the box are not "too great," a criterion that has a precise and simple quantitative statement).The next generalization, and perhaps the most amazing one, replaces this single box of tickets with an ordered indefinitely long array of boxes with tickets. Each box can have different numbers on its tickets in different proportions. The observation $x_1$ is made by drawing a ticket from the first box, $x_2$ comes from the second box, and so on.

Exactly the same conclusions holdprovided the contents of the boxes are "not too different" (there are several precise, but different, quantitative characterizations of what "not too different" has to mean; they allow an astonishing amount of latitude).These five assertions, at a minimum, need explaining.There's more. Several intriguing aspects of the setup are implicit in all the statements. For example,What is special about the sum? Why don't we have central limit theorems for other mathematical combinations of numbers such as their product or their maximum? (It turns out we do, but they are not quite so general nor do they always have such a clean, simple conclusion unless they can be reduced to the CLT.) The sequences of $m_n$ and $s_n$ are not unique but they'realmostunique in the sense that eventually they have to approximate the expectation of the sum of $n$ tickets and thestandard deviationof the sum, respectively (which, in the first two statements of the CLT, equals $\sqrt{n}$ times the standard deviation of the box).The standard deviation is one measure of the spread of values, but it is by no means the only one nor is it the most "natural," either historically or for many applications. (Many people would choose something like a median absolute deviation from the median, for instance.)

Why does the SD appear in such an essential way?Consider the formula for the limiting histogram:

who would have expected it to take such a form?It says thelogarithmof the probability density is aquadraticfunction. Why? Is there some intuitive or clear, compelling explanation for this?I confess I am unable to reach the ultimate goal of supplying answers that are simple enough to meet Srikant's challenging criteria for intuitiveness and simplicity, but I have sketched this background in the hope that others might be inspired to fill in some of the many gaps. I think a good demonstration will ultimately have to rely on an elementary analysis of how values between $\alpha_n = a s_n + m_n$ and $\beta_n = b s_n + m_n$ can arise in forming the sum $x_1 + x_2 + \ldots + x_n$. Going back to the single-box version of the CLT, the case of a symmetric distribution is simpler to handle: its median equals its mean, so there's a 50% chance that $x_i$ will be less than the box's mean and a 50% chance that $x_i$ will be greater than its mean. Moreover, when $n$ is sufficiently large, the positive deviations from the mean ought to compensate for the negative deviations in the mean. (This requires some careful justification, not just hand waving.) Thus

we ought primarily to be concerned about(Of all the things I have written here, this might be the most useful at providing some intuition about why the CLT works. Indeed, the technical assumptions needed to make the generalizations of the CLT true essentially are various ways of ruling out the possibility that rare huge deviations will upset the balance enough to prevent the limiting histogram from arising.)countingthe numbers of positive and negative deviations and only have a secondary concern about theirsizes.This shows, to some degree anyway, why the first generalization of the CLT does not really uncover anything that was not in de Moivre's original Bernoulli trial version.

At this point it looks like there is nothing for it but to do a little math:

we need to countthe number of distinct ways in which the number of positive deviations from the mean can differ from the number of negative deviations by any predetermined value $k$, where evidently $k$ is one of $-n, -n+2, \ldots, n-2, n$. But because vanishingly small errors will disappear in the limit, we don't have to count precisely; we only need to approximate the counts. To this end it suffices to know that$$\text{The number of ways to obtain } k \text{ positive and } n-k \text{ negative values out of } n$$

$$\text{equals } \frac{n-k+1}{k}$$

$$\text{times the number of ways to get } k-1 \text{ positive and } n-k+1 \text { negative values.}$$

(That's a perfectly elementary result so I won't bother to write down the justification.) Now we approximate wholesale. The maximum frequency occurs when $k$ is as close to $n/2$ as possible (also elementary). Let's write $m = n/2$. Then,

relative to the maximum frequency,the frequency of $m+j+1$ positive deviations ($j \ge 0$) is estimated by the product$$\frac{m+1}{m+1} \frac{m}{m+2} \cdots \frac{m-j+1}{m+j+1}$$

$$=\frac{1 - 1/(m+1)}{1 + 1/(m+1)} \frac{1-2/(m+1)}{1+2/(m+1)} \cdots \frac{1-j/(m+1)}{1+j/(m+1)}.$$

135 years before de Moivre was writing, John Napier invented logarithms to simplify multiplication, so let's take advantage of this. Using the approximation

$$\log\left(\frac{1-x}{1+x}\right) = -2x - \frac{2x^3}{3} + O(x^5),$$

we find that the log of the relative frequency is approximately

$$-\frac{2}{m+1}\left(1 + 2 + \cdots + j\right) - \frac{2}{3(m+1)^3}\left(1^3+2^3+\cdots+j^3\right) = -\frac{j^2}{m} + O\left(\frac{j^4}{m^3}\right).$$

Because the error in approximating this sum by $-j^2/m$ is on the order of $j^4/m^3$, the approximation ought to work well provided $j^4$ is small relative to $m^3$. That covers a greater range of values of $j$ than is needed. (It suffices for the approximation to work for $j$ only on the order of $\sqrt{m}$ which asymptotically is much smaller than $m^{3/4}$.)

Consequently, writing $$z = \sqrt{2}\,\frac{j}{\sqrt{m}} = \frac{j/n}{1 / \sqrt{4n}}$$ for the

standardized deviation,the relative frequency of deviations of size given by $z$ must be proportional to $\exp(-z^2/2)$ for large $m.$ Thus appears the Gaussian law of #3 above.Obviously much more analysis of this sort should be presented to justify the other assertions in the CLT, but I'm running out of time, space, and energy and I've probably lost 90% of the people who started reading this anyway. This simple approximation, though, suggests how

de Moivre might originally have suspected that there is a universal limiting distribution, that its logarithm is a quadratic function, and that the proper scale factor $s_n$ must be proportional to $\sqrt{n}$(as shown by the denominator of the preceding formula). It is difficult to imagine how this important quantitative relationship could be explained without invoking some kind of mathematical information and reasoning; anything less would leave the precise shape of the limiting curve a complete mystery.