Solved – How to interpret the bandwidth value in a kernel density estimation

density-estimationkernel-smoothing

I'm not sure how to interpret the value of the bandwidth parameter in kernel density estimations. Let's say I if the values range from 1 to 20. How would I need to set the bandwidth, so that each kernel ranges over two. For example, if I want to set the kernel above the point 10, then the kernel should range from [9,11], if above 15 then [14,16]. Would that simply be the bandwidth of 2?
The goal is to attach some meaning to the bandwidth.

Best Answer

For simplicity, let's assume that we are talking about some really simple kernel, say triangular kernel:

$$ K(x) = \begin{cases} 1 - |x| & \text{if } x \in [-1, 1] \\ 0 & \text{otherwise} \end{cases} $$

Recall that in kernel density estimation for estimating density $\hat f_h$ we combine $n$ kernels parametrized by $h$ centered at points $x_i$:

$$ \hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big) $$

Notice that by $\frac{x-x_i}{h}$ we mean that we want to re-scale the difference of some $x$ with point $x_i$ by factor $h$. Most of the kernels (excluding Gaussian) are limited to the $(-1, 1)$ range, so this means that they will return densities equal to zero for points out of $(x_i-h, x_i+h)$ range. Saying it differently, $h$ is scale parameter for kernel, that changes it's range from $(-1, 1)$ to $(-h, h)$.

This is illustrated on the plot below, where $n=7$ points are used for estimating kernel densities with different bandwidthes $h$ (colored points on top mark the individual values, colored lines are the kernels, gray line is overall kernel estimate). As you can see, $h < 1$ makes the kernels narrower, while $h > 1$ makes them wider. Changing $h$ influences both the individual kernels and the final kernel density estimate, since it's a mixture distribution of individual kernels. Higher $h$ makes the kernel density estimate smoother, while as $h$ gets smaller it leads to kernels being closer to individual datapoints, and with $h \rightarrow 0$ you would end up with just a bunch of Direc delta functions centered at $x_i$ points.

Examples of KDE estimates with h=0.5, h=1, h=1.5, h=2

And the R code that produced the plots:

set.seed(123)
n <- 7
x <- rnorm(n, sd = 3)

K <- function(x) ifelse(x >= -1 & x <= 1, 1 - abs(x), 0)

kde <- function(x, data, h, K) {
  n <- length(data)
  out <- outer(x, data, function(xi,yi) K((xi-yi)/h))
  rowSums(out)/(n*h)
} 

xx = seq(-8, 8, by = 0.001)
for (h in c(0.5, 1, 1.5, 2)) {
  plot(NA, xlim = c(-4, 8), ylim = c(0, 0.5), xlab = "", ylab = "",
       main = paste0("h = ", h))
  for (i in 1:n) {
    lines(xx, K((xx-x[i])/h)/n, type = "l", col = rainbow(n)[i])
    rug(x[i], lwd = 2, col = rainbow(n)[i], side = 3, ticksize = 0.075)
  }
  lines(xx, kde(xx, x, h, K), col = "darkgray")
}

For more details you can check the great introductory books by Silverman (1986) and Wand & Jones (1995).


Silverman, B.W. (1986). Density estimation for statistics and data analysis. CRC/Chapman & Hall.

Wand, M.P and Jones, M.C. (1995). Kernel Smoothing. London: Chapman & Hall/CRC.