Solved – Estimating PDF of continuous distribution from (few) data points

data visualizationdensity function

What are some good, established methods for estimating the probability density function (denoted $f(x)$ from here on) of a continuous distribution, given a sample of points $x_1, \ldots, x_n$ drawn from it? I primarily need the PDF for plotting purposes.

The naive approach would be using a histogram, i.e. counting how many points fall into different $[a,b)$ intervals. But this has several problems:

  • It doesn't give us $f(\frac{a+b}{2})$, but $\int_a^b f(x) \, dx$, which is not the same, and might look qualitatively different on a plot (e.g. for a Pareto distribution it gives an estimate of the PDF that is not a straight line on log-log scale, this is what I mean by looking qualitatively different).
  • It heavily depends on binning, requiring a careful choice of bin size.
  • Depending on the distribution, it may require a manual choice of a non-uniform bin size to get something reasonably-looking (e.g. a Pareto distribution requires increasing bins).

I am mainly interested in established methods (please note that I'm not a statistician, I don't have formal training in this, so I may not know about the obvious!), but any ideas are welcome too. E.g. would estimating the CDF by sorting the points, then somehow taking the derivative work? But then the problem is transformed to estimating the derivative of noisy data which is again a difficult problem.

I need this mainly not for fitting the PDF to some function, but for visualizing it.

EDIT: I am in particular interested in techniques that work well for long-tail distributions.

Best Answer

What you are looking for is kernel density estimation. You should find numerous hits on an internet search for these terms, and it is even on Wikipedia so that should get you started. If you have R at your disposition, the function density provides what you need:

histAndDensity<-function(x, ...)
{
  retval<-hist(x, freq=FALSE, ...)
  lines(density(x, na.rm=TRUE), col="red")
  invisible(retval)
}
Related Question