Descriptive Statistics – Choosing Bandwidth for Kernel Density Estimation

density-estimationdescriptive statisticskernel-smoothing

Are there any heuristics for selecting the bandwidth for kernel density estimation? In other words, is a spiky curve better or a smooth one?

Best Answer

I assume your sample is from a continuous distribution. Then the situation is somewhat like it is for making histograms. If you have enough data, you want to choose the width of histogram bars to be just thin enough to suggest the shape of the population density curve without the 'raggedy' look that comes from neighboring bars of very different heights (maybe interspersed with some empty bins).

If the population density is multi-modal (as for a mixture of several distributions), then you don't want bars so thick that you can't see the individual modes. [The link, provided by @whuber while I was writing this answer, pays particular attention to detecting modes.]

Using the KDE (kernel density estimator) of density in R, I have found that the default bandwidth is usually about right. Sometimes I halve it and sometimes I double it, depending on the task at hand.

Sampling from a normal distribution, you may need a sample as large as a thousand to get a useful density estimator.

set.seed(2921)
x1 = rnorm(500, 100, 15)
x2 = rnorm(1000, 100, 15)
x3 = rnorm(5000, 100, 15)

The three panels below compare R's default histogram bins, KDE badwiths (dotted brown), and population density (blue) for samples of sizes 500, 1000, and 5000. (Maybe the histogram at the right could use thinner bars, but eh default KDE bandwidth seems about right.)

enter image description here

R code for figure:

par(mfrow=c(1,3))
hist(x1, prob=T, ylim=c(0,.03), col="skyblue2")
 curve(dnorm(x, 100, 15), add=T, lwd=2, col="blue")
 lines(density(x1), lwd=3, lty="dotted", col="brown")
hist(x2, prob=T, ylim=c(0,.03), col="skyblue2")
 curve(dnorm(x, 100, 15), add=T, lwd=2, col="blue")
 lines(density(x2), lwd=3, lty="dotted", col="brown")
hist(x3, prob=T, ylim=c(0,.03), col="skyblue2")
 curve(dnorm(x, 100, 15), add=T, lwd=2, col="blue")
 lines(density(x3), lwd=3, lty="dotted", col="brown")
par(mfrow=c(1,1))

Now we look at a bimodal population with (for simplicity, exactly) a 50:50 mixture of distributions with means sufficiently far apart to make bimodal populations.

set.seed(2921)
x1 = c(rnorm(250, 80, 15), rnorm(250, 120, 15))
x2 = c(rnorm(500, 80, 15), rnorm(500, 120, 15))
x3 = c(rnorm(2500, 80, 15), rnorm(2500, 120, 15))

Here again, we use defaults in R. The histogram at the left should probably have fewer bars, but the KDE bandwidth seems OK. Often, the KDE bandwidth is less 'fussy' than the width of histogram bars.

enter image description here

par(mfrow=c(1,3))
hist(x1, prob=T, ylim=c(0,.02), col="skyblue2")
 curve(.5*dnorm(x,80,15)+.5*dnorm(x,120,15), add=T, lwd=2, col="blue")
 lines(density(x1), lwd=3, lty="dotted", col="brown")
hist(x2, prob=T, ylim=c(0,.02), col="skyblue2")
 curve(.5*dnorm(x,80,15)+.5*dnorm(x,120,15), add=T, lwd=2, col="blue")
 lines(density(x2), lwd=3, lty="dotted", col="brown")
hist(x3, prob=T, ylim=c(0,.02), col="skyblue2")
 curve(.5*dnorm(x,80,15)+.5*dnorm(x,120,15), add=T, lwd=2, col="blue")
 lines(density(x3), lwd=3, lty="dotted", col="brown")
par(mfrow=c(1,1))

Note: If the population density has restricted support (for example, $[0,\infty)$ for gamma distributions, $[0,1]$ for beta distributions, the KDEs in R are often positive just outside the region of support.