[Math] Convergence rate of the mean and variance of a normal variable – Numerical simulation

MATLABnormal distributionprobability

I am simulating in Matlab a Normal Random Variable $X$ with mean $E(X)=0$ and variance $\text{var}(X)=1$, using the built-in randnfunction.

However I am noticing that in order for the mean $(\frac{1}{N} \sum_{n=1}^N X_n)$ and variance $(\frac{1}{N} \sum_{n=1}^N X_n^2 – (\frac{1}{N} \sum_{n=1}^N X_n)^2)$ of the set of realizations $(X_n)_{n\in[1..N]}$ of this random variable to reach $0$ and $1$, respectively, with an error lower than $0.01\%$, I need $N$ to be at least higher than $10^4$.

Is there any way to reach a mean and variance of the set of realizations close to the theoretical values with less than $0.01\%$ error but with much less realizations $N$ (on the order of 10 typically). In order words is there any other way to program a probability distribution in Matlab that converges faster than randn to a gaussian distribution with given mean and variance?

Best Answer

There is no way to produce a small random sample from a Normal distribution that is guaranteed to have a sample mean and sample variance within $0.01\%$ of the population mean and population variance.


The key insight here is that even if you have a method of producing perfect random samples from a standard Normal distribution, the samples will not have the same mean and variance as the underlying distribution of the population. In other words, if you are really sampling from a Normal distribution you expect variability in your sample mean and sample variance, for a given sample size. Finding a way to get rid of that variability would force you to either produce something that is not a random sample or did not come from a Normal distribution.

The error in Matlab's randn comes from (1) machine precision limitations inherent to any finite computation and (2) the fact that it uses a pseudo random number generator. Both of these limitations are inherent to any computational implementation that produces samples from a probability distribution in a repeatable way. Matlab R2016A and other modern versions default to the Mersenne Twister method, which produces high quality pseudo random numbers. The error from the generator and the machine precision is much less than $0.01\%$.

Almost all of the difference between your sample mean and the population mean comes from the randomness in the Normal distribution itself, so it's best to think of that difference as variability rather than error.

If you need a small set of numbers that has sample mean and sample variance close to the population values, you could for example take $\{-1.32,-0.54,0.56,1.3\}$. This set of numbers has $(\frac{1}{N} \sum_{n=1}^N X_n) = 0$ and $(\frac{1}{N} \sum_{n=1}^N X_n^2 - (\frac{1}{N} \sum_{n=1}^N X_n)^2) = 1.0094$ $^*$. However, I explicitly chose these numbers - they were not randomly sampled from a Normal distribution.

The main conclusion is that the variability in the sample is a good thing because it represents the true variability in samples from a Normal distribution. If you're using the normal random numbers as an input to further simulation, consider variance reduction techniques to address this problem while maintaining rigor in your random sampling.


$^*$ I'm using your formula for variance but you should carefully consider which sample variance you want to use.