Normal Distribution Sampling – How to Sample with Known Mean and Variance

computational-statisticsnormal distributionsampling

I've never had a course in statistics, so I hope I'm asking in the right place here.

Suppose I have only two data describing a normal distribution: the mean $\mu$ and variance $\sigma^2$. I want to use a computer to randomly sample from this distribution such that I respect these two statistics.

It's pretty obvious that I can handle the mean by simply normalizing around 0: just add $\mu$ to each sample before outputting the sample. But I don't see how to programmatically generate samples to respect $\sigma^2$.

My program will be in a conventional programming language; I don't have access to any statistical packages.

Best Answer

If you can sample from a given distribution with mean 0 and variance 1, then you can easily sample from a scale-location transformation of that distribution, which has mean $\mu$ and variance $\sigma^2$. If $x$ is a sample from a mean 0 and variance 1 distribution then $$\sigma x + \mu$$ is a sample with mean $\mu$ and variance $\sigma^2$. So, all you have to do is to scale the variable by the standard deviation $\sigma$ (square root of the variance) before adding the mean $\mu$.

How you actually get a simulation from a normal distribution with mean 0 and variance 1 is a different story. It's fun and interesting to know how to implement such things, but whether you use a statistical package or programming language or not, I will recommend that you obtain and use a suitable function or library for the random number generation. If you want advice on what library to use you might want to add specific information on which programming language(s) you are using.

Edit: In the light of the comments, some other answers and the fact that Fixee accepted this answer, I will give some more details on how one can use transformations of uniform variables to produce normal variables.

  • One method, already mentioned in a comment by VitalStatistix, is the Box-Muller method that takes two independent uniform random variables and produces two independent normal random variables. A similar method that avoids the computation of two transcendental functions sin and cos at the expense of a few more simulations was posted as an answer by francogrex.
  • A completely general method is the transformation of a uniform random variable by the inverse distribution function. If $U$ is uniformly distributed on $[0,1]$ then $$\Phi^{-1}(U)$$ has a standard normal distribution. Though there is no explicit analytic formula for $\Phi^{-1}$, it can be computed by accurate numerical approximations. The current implementation in R (last I checked) uses this idea. The method is conceptually very simple, but requires an accurate implementation of $\Phi^{-1}$, which is probably not as widespread as the (other) transcendental functions log, sin and cos.
  • Several answers mention the possibility of using the central limit theorem to approximate the normal distribution as an average of uniform random variables. This is not generally recommended. Arguments presented, such as matching the mean 0 and variance 1, and considerations of support of the distribution are not convincing. In Exercise 2.3 in "Introducing Monte Carlo Methods with R" by Christian P. Robert and George Casella this generator is called antiquated and the approximation is called very poor.
  • There is a bewildering number of other ideas. Chapter 3 and, in particular, Section 3.4, in "The Art of Computer Programming" Vol. 2 by Donald E. Knuth is a classical reference on random number generation. Brian Ripley wrote Computer Generation of Random Variables: A Tutorial, which may be useful. The book mentioned by Robert and Casella, or perhaps Chapter 2 in their other book, "Monte Carlo statistical methods", is also recommended.

At the end of the day, a correctly implemented method is not better than the uniform pseudo random number generator used. Personally, I prefer to rely on special purpose libraries that I believe are trustworthy. I almost always rely on the methods implemented in R either directly in R or via the API in C/C++. Obviously, this is not a solution for everybody, but I am not familiar enough with other libraries to recommend alternatives.