Solved – Generating Data from Arbitrary Distribution

data transformationdistributionsquantilesuniform distribution

If we want to generate a random sample according to a distribution $F$, we can generate a uniform random number on $(0,1)$ and invert it by $F$. This is due to the fact that, if $U$ is uniform on $(0,1)$, then $X=F^{-1}(U)$ is a random variable that follows $F$.

I know this is true as $P(X\le x) = P(F^{-1}(U)\le x) = P(U\le F(x)) = F(x)$.

Does anyone know the intuition behind this? What would have made someone to assert this? Any help is greatly appreciated.

Best Answer

This is known as inverse transform sampling. The idea is well encapsulated in the following picture from Wikipedia:

Note that the image of the cumulative distribution function (CDF) $F_X$ is the interval $[0,1]$ on the $y$ axis. (Purists will discuss whether the endpoints should be included or not.) Also note that the CDF is of course monotone.

In inverse transform sampling, we sample uniformly from this image, i.e., $U[0,1]$. These are the dots on the $y$ axis. We then go right from these dots to the graph of $F_X$, then down to the $x$ axis. This is where the "inverse" comes in: because we start from the $y$ axis and end up on the $x$ axis.

The result on the $x$ axis is distributed according to $F_X$.

Where $F_X$ is steep (i.e., the density $f_X$ is large), $y$ values that are close together yield $x$ values that are close together. We get a high density of $x$ values.
Where $F_X$ is flat (i.e., $f_X$ is small), $y$ values that are close together yield $x$ values that are farther apart. We get a low density of $x$ values.

Related Solutions

Solved – Uniform random variable distribution

$[t]$ is the floor function, and $t$ just represents a generic argument. So for example $[0.5]=0$, $[0.9]=0$, $[1.01]=1$, $[1]=1$, $[23.567]=23$, and so on. You simply ignore whats written after the decimal point (note: this is not the same thing as rounding, for $[0.9]=0$ whereas rounding would give $1$.)

With non-smooth functions such as the floor function, the safest way to go is to use the cumulative distribution function, or CDF. For the uniform distribution this is given by:

$$F_{U}(y)=\Pr(U<y)=\int_{0}^{y}f_{U}(t)dt=\int_{0}^{y}dt=y$$

Now the good thing about CDFs is that you can simply substitute the functional relation in, but only once you have inverted the floor function. Now this inversion is not 1-to-1, so a standard change of variables using jacobian's doesn't apply. For example, suppose $X=0$. Then we know that $[nU]=0$, which means that $nU<1$, which implies that $U<n^{-1}$. We can work out this probability directly from the CDF:

$$\Pr(X=0)=\Pr(U<n^{-1})=F_{U}(n^{-1})=n^{-1}$$

The reason we can do this is that the two propositions " $X=0$ " and " $U<n^{-1}$ " are equivalent - one occurs if and only if the other occurs. So they must have the same "truth value" and hence also the same probability.

This is not too hard to continue on. Suppose $X=1$, then we must have $nU<2$ (or else $X>1$) and we must also have $nU>1$ (or else $X=0$ as we have just seen). So the equivalent condition to $X=1$ in terms of $U$ is $1<nU<2$. I'll stop my answer here so you can work out the general form of the probability mass function for $X$ ($\Pr(X=z)$ for general argument $z$).

One small hint is to note that $\Pr(a<U<b)=\Pr(U<b)-\Pr(U<a)=b-a$ for a uniform distribution.

I can post the full answer if you wish, but you may not learn as well compared to if you do it yourself.

Solved – Generating random samples from a custom distribution

It looks like you figured out that your code works, but @Aniko pointed out that you could improve its efficiency. Your biggest speed gain would probably come from pre-allocating memory for z so that you're not growing it inside a loop. Something like z <- rep(NA, nsamples) should do the trick. You may get a small speed gain from using vapply() (which specifies the returned variable type) instead of an explicit loop (there's a great SO question on the apply family).

> nsamples <- 1E5
> x <- runif(nsamples)
> f <- function(x, u) 1.5 * (x - (x^3) / 3) - u
> z <- c()
> 
> # original version
> system.time({
+ for (i in 1:nsamples) {
+   # find the root within (0,1) 
+   r <- uniroot(f, c(0,1), tol = 0.0001, u = x[i])$root
+   z <- c(z, r)
+ }
+ })
   user  system elapsed 
  49.88    0.00   50.54 
> 
> # original version with pre-allocation
> z.pre <- rep(NA, nsamples)
> system.time({
+ for (i in 1:nsamples) {
+   # find the root within (0,1) 
+   z.pre[i] <- uniroot(f, c(0,1), tol = 0.0001, u = x[i])$root
+   }
+ })
   user  system elapsed 
   7.55    0.01    7.78 
> 
> 
> 
> # my version with sapply
> my.uniroot <- function(x) uniroot(f, c(0, 1), tol = 0.0001, u = x)$root
> system.time({
+   r <- vapply(x, my.uniroot, numeric(1))
+ })
   user  system elapsed 
   6.61    0.02    6.74 
> 
> # same results
> head(z)
[1] 0.7803198 0.2860108 0.5153724 0.2479611 0.3451658 0.4682738
> head(z.pre)
[1] 0.7803198 0.2860108 0.5153724 0.2479611 0.3451658 0.4682738
> head(r)
[1] 0.7803198 0.2860108 0.5153724 0.2479611 0.3451658 0.4682738

And you don't need the ; at the end of each line (are you a MATLAB convert?).

Best Answer

Related Solutions

Solved – Uniform random variable distribution

Solved – Generating random samples from a custom distribution

Related Question