[Math] Histogram of random numbers from normal distribution

normal distributionstatistics

If I generate, say, 10000 numbers from the normal distribution (in Matlab) and want to draw a histogram with 10 bins, it resembles the normal distribution pretty accurately. However, if I decide to draw a histogram with 100 bins or 1000 bins, it doesn't look like a normal distribution anymore (it looks more noisy, with some peaks and valleys in different places). It would get better if I had a billion numbers.

Why does it work that way? How to explain it mathematically? My intuition is that as the number of histogram bins gets larger, the probability that the generated number belongs to such a small interval (we have large number of bins) gets smaller. But it's just intuition, not a mathematical explanation.

Best Answer

Your intuition is exactly correct. If you make the bins small enough then there is effectively a zero probability of seeing a sample in that exact bin. Hence for bins approaching zero width you get either a count of one or zero for your histogram, which is why it looks nothing like your continuous density function.

Mathematically we have the following set-up. Let $X_1\,\ldots X_N$ be i.i.d. random variables from some distribution, for which we have a cumulative distribution function (CDF); $\mathbb{P}( X_i\leq a) : \mathbb{R}\rightarrow[0,1]$. For some semi-closed interval $[a,b)$, let $\mathcal{X}=\{X_i : X_i \in [a,b)\}$ and further, let $M = |\mathcal{X}| \leq N$ be the number of draws which happen to fall in our interval $[a,b)$. Note that $M$ is a random variable.

We can see intuitively that $$\lim_{N\rightarrow\infty} \frac{M}{N} \xrightarrow{p} \mathbb P(X\in[a,b)),$$ that is, if we have infinite samples the fraction falling in our interval is exactly equal to the probability mass between $a$ and $b$. Further one can show that even for finite $N$ we have $$\mathbb{E}\left[\frac{M}{N}\right] = \mathbb{P}(X\in [a,b))$$

To show these facts we wish to know what the distribution of the random variable $M$ is. In this case because $M$ is discrete, it has a probability mass function (not a density) $\mathbb P(M=m)$.

For any one draw $X_i$, the probability of being in our interval is just $\mathbb P(X\in[a,b)):=p$, and the probability of being not in our interval is just $(1-p)$, i.e. a Bernoulli RV. Thus for $N$ draws the probability of having $m$ in our interval is given by $$\mathbb P(M=m) = \binom{N}{m} \,p^m \,(1-p)^{N-m}$$ which should be recognizable as the Binomial distribution. The binomial distribution has mean (expectation) of $N\,p$ and thus we get the second equation from above directly; $$\mathbb E \left[\frac{M}{N}\right] = \frac{1}{N} \mathbb E[ M ] = \frac{N\,p}{N} = p$$ This tells us that the expected value of $M/N$ is indeed the mass associated with our interval. To show the limit above holds we just need to show that as $N\rightarrow \infty$ the variance becomes zero. This also follows easily from the fact $M$ is binomially distributed; $$\mathrm{Var}\left(\frac{M}{N}\right) = \frac{1}{N^2}\mathrm{Var}(M) = \frac{Np(1-p)}{N^2} = \frac{p(1-p)}{N}$$ and we're done.

As an example, lets consider RVs drawn from $\mathcal N(0;\sigma^2)$ and the interval $[0,\infty)$. Clearly in this case we have $p=\mathbb P(X\in[0,\infty)) = 1/2$ and thus the variance of the number of samples in our bin should go as $\frac{1}{4 N}$ with the number of draws. Indeed, this is exactly what we see if we test this hypothesis;

enter image description here

So, how does this relate to your exact question? Well we can see that if we increase $N$ while keeping the bins the same, the variance in our histogram bars will decrease in keeping with our intuition. As for the bin widths, it might seem like we expect the variance to increase if we decrease the width a lot but actually if $p<\frac{1}{2}$ then in fact decreasing the bin width, and with it $p$, leads to a decrease in the variance of $M$. This can be explained by the fact that for sufficiently small $p$, the most likely outcomes are $M=0$ and thus the variance becomes small again because the probability of seeing no (or few) samples in our bin dominates and thus pulls the variance down; remember that $$\mathrm{Var}(M)\sim \sum_{m}m^2\,\mathbb P(M=m).$$ Unfortunately this very effect makes it difficult to define an 'optimum' number of bins because we can either minimize the variance by choosing the whole real line as our bin (in which case $(1-p)=0$ and our variance is zero) or by letting the number of bins go to infinity so that $\min_{\mathrm{bins}}\{p\}\rightarrow 0$.

Best Answer

Related Solutions

Related Question