R – How to Draw Random Samples from a Non-Parametric Estimated Distribution

kernel-smoothingrsampling

I have a sample of 100 points which are continuous and one-dimensional. I estimated its non-parametric density using kernel methods. How can I draw random samples from this estimated distribution?

Best Answer

A kernel density estimate is a mixture distribution; for every observation, there's a kernel. If the kernel is a scaled density, this leads to a simple algorithm for sampling from the kernel density estimate:

repeat nsim times:
  sample (with replacement) a random observation from the data
  sample from the kernel, and add the previously sampled random observation

If (for example) you used a Gaussian kernel, your density estimate is a mixture of 100 normals, each centred at one of your sample points and all having standard deviation $h$ equal to the estimated bandwidth. To draw a sample you can just sample with replacement one of your sample points (say $x_i$) and then sample from a $N(\mu = x_i, \sigma = h)$. In R:

# Original distribution is exp(rate = 5)
N = 1000
x <- rexp(N, rate = 5)

hist(x, prob = TRUE)
lines(density(x))

# Store the bandwith of the estimated KDE
bw <- density(x)$bw

# Draw from the sample and then from the kernel
means <- sample(x, N, replace = TRUE)
hist(rnorm(N, mean = means, sd = bw), prob = TRUE)

Strictly speaking, given that the mixture's components are equally weighted, you could avoid the sampling with replacement part and simply draw a sample a size $M$ from each components of the mixture:

M = 10
hist(rnorm(N * M, mean = x, sd = bw))

If for some reason you can't draw from your kernel (ex. your kernel is not a density), you can try with importance sampling or MCMC. For example, using importance sampling:

# Draw from proposal distribution which is normal(mu, sd = 1)
sam <- rnorm(N, mean(x), 1)

# Weight the sample using ratio of target and proposal densities
w <- sapply(sam, function(input) sum(dnorm(input, mean = x, sd = bw)) / 
                                 dnorm(input, mean(x), 1))

# Resample according to the weights to obtain an un-weighted sample
finalSample <- sample(sam, N, replace = TRUE, prob = w)

hist(finalSample, prob = TRUE)

P.S. With my thanks to Glen_b who contributed to the answer.

Related Solutions

Solved – How to draw random sample from an empirical joint distribution

If you want to sample (Age,Income) pairs, use any old sampling method directly on your joint distribution. Using numpy:

dist = [(("15-20 yr", "$1-10000"), 0.12), (("15-20 yr", "$10000-20000"), 0.08), ...]
num_of_samples = 10 # choose any size you want here
samples_i = numpy.random.choice(len(dist), size=num_of_samples, p=[x[1] for x in dist])
samples = [dist[x][0] for x in samples_i]

Kernel Density Estimate – How to Randomly Draw a Value

A kernel density estimator (KDE) produces a distribution that is a location mixture of the kernel distribution, so to draw a value from the kernel density estimate all you need do is (1) draw a value from the kernel density and then (2) independently select one of the data points at random and add its value to the result of (1).

Here is the result of this procedure applied to a dataset like the one in the question.

The histogram at the left depicts the sample. For reference, the black curve plots the density from which the sample was drawn. The red curve plots the KDE of the sample (using a narrow bandwidth). (It is not a problem, or even unexpected, that the red peaks are shorter than the black peaks: the KDE spreads things out, so peaks will get lower to compensate.)

The histogram at the right depicts a sample (of the same size) from the KDE. The black and red curves are the same as before.

Evidently, the procedure used to sample from the density works. It's also extremely fast: the R implementation below generates millions of values per second from any KDE. I have commented it heavily to assist in the porting to Python or other languages. The sampling algorithm itself is implemented in the function rdens with the lines

rkernel <- function(n) rnorm(n, sd=width) 
sample(x, n, replace=TRUE) + rkernel(n)

rkernel draws n iid samples from the kernel function while sample draws n samples with replacement from the data x. The "+" operator adds the two arrays of samples component by component.

For those who would like a formal demonstration of correctness, I offer it here. Let $K$ represent the kernel distribution with CDF $F_K$ and let the data be $\mathbf{x}=(x_1, x_2, \ldots, x_n)$. By the definition of a kernel estimate, the CDF of the KDE is

$$F_{\mathbf{\hat{x}};\, K}(x) = \frac{1}{n}\sum_{i=1}^n F_K(x-x_i).$$

The preceding recipe says to draw $X$ from the empirical distribution of the data (that is, it attains the value $x_i$ with probability $1/n$ for each $i$), independently draw a random variable $Y$ from the kernel distribution, and sum them. I owe you a proof that the distribution function of $X+Y$ is that of the KDE. Let's start with the definition and see where it leads. Let $x$ be any real number. Conditioning on $X$ gives

$$\eqalign{ F_{X+Y}(x) &= \Pr(X+Y \le x) \\ &= \sum_{i=1}^n \Pr(X+Y \le x \mid X=x_i) \Pr(X=x_i) \\ &= \sum_{i=1}^n \Pr(x_i + Y \le x) \frac{1}{n} \\ &= \frac{1}{n}\sum_{i=1}^n \Pr(Y \le x-x_i) \\ &= \frac{1}{n}\sum_{i=1}^n F_K(x-x_i) \\ &= F_{\mathbf{\hat{x}};\, K}(x), }$$

as claimed.

#
# Define a function to sample from the density.
# This one implements only a Gaussian kernel.
#
rdens <- function(n, density=z, data=x, kernel="gaussian") {
  width <- z$bw                              # Kernel width
  rkernel <- function(n) rnorm(n, sd=width)  # Kernel sampler
  sample(x, n, replace=TRUE) + rkernel(n)    # Here's the entire algorithm
}
#
# Create data.
# `dx` is the density function, used later for plotting.
#
n <- 100
set.seed(17)
x <- c(rnorm(n), rnorm(n, 4, 1/4), rnorm(n, 8, 1/4))
dx <- function(x) (dnorm(x) + dnorm(x, 4, 1/4) + dnorm(x, 8, 1/4))/3
#
# Compute a kernel density estimate.
# It returns a kernel width in $bw as well as $x and $y vectors for plotting.
#
z <- density(x, bw=0.15, kernel="gaussian")
#
# Sample from the KDE.
#
system.time(y <- rdens(3*n, z, x)) # Millions per second
#
# Plot the sample.
#
h.density <- hist(y, breaks=60, plot=FALSE)
#
# Plot the KDE for comparison.
#
h.sample <- hist(x, breaks=h.density$breaks, plot=FALSE)
#
# Display the plots side by side.
#
histograms <- list(Sample=h.sample, Density=h.density)
y.max <- max(h.density$density) * 1.25
par(mfrow=c(1,2))
for (s in names(histograms)) {
  h <- histograms[[s]]
  plot(h, freq=FALSE, ylim=c(0, y.max), col="#f0f0f0", border="Gray",
       main=paste("Histogram of", s))
  curve(dx(x), add=TRUE, col="Black", lwd=2, n=501) # Underlying distribution
  lines(z$x, z$y, col="Red", lwd=2)                 # KDE of data

}
par(mfrow=c(1,1))

Best Answer

Related Solutions

Solved – How to draw random sample from an empirical joint distribution

Kernel Density Estimate – How to Randomly Draw a Value

Related Question