Solved – Sample from distribution given by histogram

density functionhistogramrsampling

Given a histogram obtained using given data points, how do I randomly sample from the distribution predicted by the histogram?

Any conceptual comment / R code would be welcome.

Best Answer

Since the sampling from a kernel density estimate is solved once or twice already, I'll focus on sampling from a histogram-as-population-pdf.

The idea is simply

For each observation in the new sample

  1. choose a histogram bin according to the proportions of 
     the original sample (treated as a discrete pmf)

  2. sample uniformly from that bin-interval

For example in R:

#create an original histogram
x=rgamma(200,4)
xhist=hist(x,freq=FALSE)

#sample from it
samplesize=400
bins=with(xhist,sample(length(mids),samplesize,p=density,replace=TRUE)) # choose a bin
result=runif(length(bins),xhist$breaks[bins],xhist$breaks[bins+1]) # sample a uniform in it
hist(result,freq=FALSE,add=TRUE,bord=3)

Just for completeness, (since sampling from the kernel density estimate* is very simple):

repeat nsim times:
  sample (with replacement) a random observation from the data
  sample from the kernel, and add the previously sampled random observation

* note that some kernels - like fourth order kernels - are not densities and this assumes that the kernel is a density

In R, for a Gaussian kernel and bandwidth h, with data in x:

 dnorm(nsim,m=sample(x,nsim,replace=TRUE), s=h)

Related Solutions

Solved – Nonparametric expected value estimation of sample from unknown distribution

It is quite simple; you make a subsample by sampling with replacement:

sample(x,replace=T)

calculate the statistic you want on it:

mean(sample(x,replace=T))

finally average it over many repetitions:

mean(replicate(1000,mean(sample(x,replace=T)))

Solved – Estimating PDF of continuous distribution from (few) data points

What you are looking for is kernel density estimation. You should find numerous hits on an internet search for these terms, and it is even on Wikipedia so that should get you started. If you have R at your disposition, the function density provides what you need:

histAndDensity<-function(x, ...)
{
  retval<-hist(x, freq=FALSE, ...)
  lines(density(x, na.rm=TRUE), col="red")
  invisible(retval)
}

Best Answer

Related Solutions

Solved – Nonparametric expected value estimation of sample from unknown distribution

Solved – Estimating PDF of continuous distribution from (few) data points

Related Question