Normal Distribution – Error in Normal Approximation to Uniform Sum Distribution

approximationcentral limit theoremmomentsnormal distribution

One naive method for approximating a normal distribution is to add together perhaps $100$ IID random variables uniformly distributed on $[0,1]$, then recenter and rescale, relying on the Central Limit Theorem. (Side note: There are more accurate methods such as the Box–Muller transform.) The sum of IID $U(0,1)$ random variables is known as the uniform sum distribution or Irwin–Hall distribution.

How large is the error in approximating a uniform sum distribution by a normal distribution?

Whenever this type of question comes up for approximating the sum of IID random variables, people (including me) bring up the Berry–Esseen Theorem, which is an effective version of the Central Limit Theorem given that the third moment exists:

$$|F_n(x) – \Phi(x)| \le \frac{C \rho}{\sigma^3 \sqrt n} $$

where $F_n$ is the cumulative distribution function for the rescaled sum of $n$ IID random variables, $\rho$ is the absolute third central moment $E|(X-EX)^3|$, $\sigma$ is the standard deviation, and $C$ is an absolute constant which can be taken to be $1$ or even $1/2$.

This is unsatisfactory. It seems to me that the Berry–Esseen estimate is closest to sharp on binomial distributions which are discrete, with the largest error at $0$ for a symmetric binomial distribution. The largest error comes at the largest jump. However, the uniform sum distribution has no jumps.

Numerical tests suggest that the error shrinks more rapidly than $c/\sqrt n$.

Using $C=1/2$, the Berry–Esseen estimate is $$|F_n(x) – \Phi(x)| \le \frac{\frac12 \frac{1}{32}}{\frac{1}{\sqrt{12}^3} \sqrt n} \approx \frac{0.650}{\sqrt n}$$

which for $n=10,20,40$ is about $0.205$, $0.145$, and $0.103$, respectively. The actual maximum differences for $n=10, 20, 40$ appear to be about $0.00281$, $0.00139$, and $0.000692$, respectively, which are much smaller and appear to fall as $c/n$ instead of $c/\sqrt n$.

Best Answer

Let $U_1, U_2,\dots$ be iid $\mathcal U(-b,b)$ random variables and consider the normalized sum $$ S_n = \frac{\sqrt{3} \sum_{i=1}^n U_i}{b \sqrt{n}} \>, $$ and the associated $\sup$ norm $$ \delta_n = \sup_{x\in\mathbb R} |F_n(x) - \Phi(x)| \>, $$ where $F_n$ is the distribution of $S_n$.

Lemma 1 (Uspensky): The following bound on $\delta_n$ holds. $$ \delta_n < \frac{1}{7.5 \pi n} + \frac{1}{\pi}\left(\frac{2}{\pi}\right)^n + \frac{12}{\pi^3 n} \exp(-\pi^2 n / 24) \>. $$

Proof. See J. V. Uspensky (1937), Introduction to mathematical probability, New York: McGraw-Hill, p. 305.

This was later improved by R. Sherman to the following.

Lemma 2 (Sherman): The following improvement on the Uspensky bound holds. $$\delta_n < \frac{1}{7.5 \pi n} - \left(\frac{\pi}{180}+\frac{1}{7.5\pi n}\right) e^{-\pi^2 n / 24} + \frac{1}{(n+1)\pi}\left(\frac{2}{\pi}\right)^n + \frac{12}{\pi^3 n} e^{-\pi^2 n / 24} \>.$$

Proof: See R. Sherman, Error of the normal approximation to the sum of N random variables, Biometrika, vol. 58, no. 2, 396–398.

The proof is a pretty straightforward application of the triangle inequality and classical bounds on the tail of the normal distribution and on $(\sin x) / x$ applied to the characteristic functions of each of the two distributions.

Related Question