What’s wrong with the simulation of the Central Limit Theorem

central limit theoremprobability

While there is code in this question, I suspect the answer will be mathematical.

I am trying to create a numerical simulation of the Central Limit Theorem (CLT). My understanding is that if $S_n = \frac{X_1 + X_2 + \dots + X_n}{n}$ for $n$ i.i.d. random variables $\{X_1, X_2, \dots X_n\}$, then the CLT states that

$$
\lim_{n \rightarrow \infty} \sqrt{n}(S_n – \mu) \stackrel{\text{dist}}{\rightarrow} \mathcal{N}(0, \sigma^2)
$$

where $X_i$ has expectation $\mu$ and variance $\sigma^2$. So one way to empirically explore this is to plot both the distribution $\mathcal{N}(0, \sigma^2)$ and the distribution from $n$ random samples of the random variable $\sqrt{n}(S_n – \mu)$. When I do that, I get this figure:

enter image description here

My code is below. In my simulation, $X_i \sim \text{Uniform}(0, 1)$, meaning that $\mu = \frac{1}{2}(0 – 1)$ and $\sigma^2 = \frac{1}{12}(1 – 0)$. I first generate 2000 replicates of $\sqrt{n} (S_n – \mu)$ and plot that distribution. I then plot the theoretical distribution that I expect this distribution to converge to, namely $\mathcal{N}(0, \sigma^2)$. Unfortunately, the two distributions look way off.

I suspect I have a "mathematical bug" rather than an programming one. In summary: What is wrong with my thinking about the CLT that makes this simulation wrong?


Simulation code

import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from   scipy.stats import norm as normal

a = 0
b = 1
n = 100
reps = 2000
mu_lim  = 1/2.  * (a + b)
var_lim = 1/12. * (b - a)**2

rvs = []
for _ in range(reps):
    Sn = np.random.uniform(a, b, n).sum() / n
    rv = np.sqrt(n) * (Sn - mu_lim)
    rvs.append(rv)

_, bins, _ = plt.hist(rvs, bins=50, density=True)
(mu_actual, var_actual) = normal.fit(rvs)
y = normal.pdf(bins, mu_actual, var_actual)
plt.plot(bins, y, 'r--', label='Actual N(., .) for n = %s' % n)

# Plot distribution to converge to
x = np.linspace(-3*var_lim, 3*var_lim, 100)
plt.plot(x, normal.pdf(x, 0, var_lim), label='N(0, s^2)')

plt.legend()
plt.show()

Best Answer

The norm.pdf function names the parameters “loc” and “scale”, and “scale” strongly suggests it should take the standard deviation, whereas you have passed in the variance. Per Did’s comment, it makes sense that you see a peak $\sqrt{12}$ times too high, around 4.78 rather than 1.38.