Solved – How to draw random sample from an empirical joint distribution

joint distributionsampling

I want to draw a random sample from a known joint distribution. what I have now is the number of people in each age and income bracket (as shown in the example below):0.12 0.42 0.06
16 0.08 0.28 0.04

                   income $1-10,000    income $10,001-20,000
15-20 years old      0.12                      0.08
21-25 years old      0.42                      0.28
26-30 years old      0.06                      0.04

Using this I will get the joint distribution of income and age. But I am not sure where to go from here in terms of drawing samples.

Any thoughts are appreciated. My knowledge in statistics is quite limited…

Best Answer

If you want to sample (Age,Income) pairs, use any old sampling method directly on your joint distribution. Using numpy:

dist = [(("15-20 yr", "$1-10000"), 0.12), (("15-20 yr", "$10000-20000"), 0.08), ...]
num_of_samples = 10 # choose any size you want here
samples_i = numpy.random.choice(len(dist), size=num_of_samples, p=[x[1] for x in dist])
samples = [dist[x][0] for x in samples_i]

Related Solutions

R – How to Draw Random Samples from a Non-Parametric Estimated Distribution

A kernel density estimate is a mixture distribution; for every observation, there's a kernel. If the kernel is a scaled density, this leads to a simple algorithm for sampling from the kernel density estimate:

repeat nsim times:
  sample (with replacement) a random observation from the data
  sample from the kernel, and add the previously sampled random observation

If (for example) you used a Gaussian kernel, your density estimate is a mixture of 100 normals, each centred at one of your sample points and all having standard deviation $h$ equal to the estimated bandwidth. To draw a sample you can just sample with replacement one of your sample points (say $x_i$) and then sample from a $N(\mu = x_i, \sigma = h)$. In R:

# Original distribution is exp(rate = 5)
N = 1000
x <- rexp(N, rate = 5)

hist(x, prob = TRUE)
lines(density(x))

# Store the bandwith of the estimated KDE
bw <- density(x)$bw

# Draw from the sample and then from the kernel
means <- sample(x, N, replace = TRUE)
hist(rnorm(N, mean = means, sd = bw), prob = TRUE)

Strictly speaking, given that the mixture's components are equally weighted, you could avoid the sampling with replacement part and simply draw a sample a size $M$ from each components of the mixture:

M = 10
hist(rnorm(N * M, mean = x, sd = bw))

If for some reason you can't draw from your kernel (ex. your kernel is not a density), you can try with importance sampling or MCMC. For example, using importance sampling:

# Draw from proposal distribution which is normal(mu, sd = 1)
sam <- rnorm(N, mean(x), 1)

# Weight the sample using ratio of target and proposal densities
w <- sapply(sam, function(input) sum(dnorm(input, mean = x, sd = bw)) / 
                                 dnorm(input, mean(x), 1))

# Resample according to the weights to obtain an un-weighted sample
finalSample <- sample(sam, N, replace = TRUE, prob = w)

hist(finalSample, prob = TRUE)

P.S. With my thanks to Glen_b who contributed to the answer.

Solved – Generate bivariate random numbers from joint distribution function

Assuming that the only thing that you have is an empirical distribution, the simplest way to go is to draw values from $\hat F$ and reject if $X_1 \leq c_1$ and $X_2 \leq c_2$. Less naive implementation would be to subset $(X_1, X_2)$ values so to drop the values below threshold and draw from $\hat F_\text{trunc}$, the same way as you would do with any other discrete distribution. Simple example in R of such approach can be find below.

set.seed(123)

c1 <- -1
c2 <- 0.5
X <- data.frame(X1 = rnorm(100), X2 = rnorm(100)) # creating fake data
X_trunc <- subset(X, X1 > c1 & X2 > c2) # subset

# draw 1000 samples
X_trunc[sample(nrow(X_trunc), 1000, replace = TRUE), ]

The above example assumes that you have the full data, however if the only thing that you have is the empirical distribution tables with probabilities for $(x_1,x_2)$, than the procedure is the same but you draw the $(x_1,x_2)$ pairs with $\hat F(x_1,x_2)$ probabilities as in the example below.

library(dplyr)

# lets calculate the probabilities for x1,x2 pairs
FX <- group_by(X, X1, X2) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  mutate(prob = n/sum(n))

# next, we subset and sample as above but from F(X1,X2)
FX_trunc <- subset(FX, X1 > c1 & X2 > c2)

# notice that here we sample with parameter prob set to F(x1,x2)
FX_trunc[sample(nrow(FX_trunc), 1000, replace = TRUE, prob = FX_trunc$prob), ]

Drawing from bivariate distribution does not differ in here from drawing from univariate distribution, the values to be drawn are pairs, or more precisely indexes for those pairs.

Best Answer

Related Solutions

R – How to Draw Random Samples from a Non-Parametric Estimated Distribution

Solved – Generate bivariate random numbers from joint distribution function

Related Question