Solved – Why does definition of kernel include bandwidth

density-estimationkernel-smoothingr

I realized that the density() function in R which computes the kernel density estimation have the kernel scaled so that the bandwidth bw specified in the argument is the standard deviation of the selected kernel (e.g. “gaussian”, “epanechnikov”..) . For easy reference, below is the definition given in R:

bw – the smoothing bandwidth to be used. The kernels are scaled such that this is the standard deviation of the smoothing kernel. (Note this differs from the reference books cited below, and from S-PLUS.)

This means that Gaussian kernel is defined as K= dnorm(x, sd=bw) instead of the standard normal distribution, i.e. K=dnorm(x). For epanechnikov, the definition in density() function is K(u)= 3/4((1-(u/a)^2)/a) for |u|< a, with a=bw*sqrt(5), instead of the more common definition found in reference book, i.e. K(u)=3/4 (1-u^2 )I(|u|≤1).

Does anyone know why the kernel is scaled in such a way that the bandwidth is the standard deviation of the specified kernel in density()?

Does such scaling lead to better estimates?

Hope someone can shed some light on this. Thank you in advance!

Best Answer

TLDR; you ask about two things: (1) why do we parametrize the kernels with standard deviation, and (2) why do we need a bandwidth parameter. The answer for the first question is simple: we use standard deviation because it makes things simpler, but it doesn't matter for performance. As about second question, we need bandwidth parameter (no matter if it is standard deviation, or something else) to adjust our kernel density to your data and changing it affects the performance.

As noted by Glen_b, answer to the first question is discussed in the documentation of ?density (notation was slightly changed by me)

The statistical properties of a kernel are determined by $\sigma^2_K = \int t^2 \, K(t) \, dt$ which is always = 1 for our kernels (and hence the bandwidth bw is the standard deviation of the kernel) and $R(K) = \int \, K(t)^2 \, dt$. MSE-equivalent bandwidths (for different kernels) are proportional to $\sigma_K R(K)$ which is scale invariant and for our kernels equal to $R(K)$.

Saying this in plain English: different kernels, in their "raw" form, have different standard deviations $\sigma_K$. When we choose different kernels, we want them to have the same scale. It is convenient to have single bw parameter that has the same meaning for all kernels, i.e. standard deviation of the kernel. This is easily achieved by making the kernels to have "by default" standard deviation equal to 1, and then re-scaling them using bandwidth parameter $h$ by taking $K_h(x) = K(x/h)/h$ (cf. location scale family of distributions).

Moreover, notice (again, as noted by Glen_b) that most of the criteria for comparing performance of different kernels take into consideration the roughness of the kernel $R(K)$ (see above) and its standard deviation $\sigma_K$, so if we make $\sigma_K$ the same for each kernel, then it simplifies lots of things. If you check the handbooks on kernel density estimation, you will notice that lots of equations are much simpler if we make $\sigma_K=1$.

So it doesn't matter if you parametrize your kernel by standard deviation, or some other kind of parameter, but it makes life easier.

As about your second question:

This means that Gaussian kernel is defined as K= dnorm(x, sd=bw) instead of the standard normal distribution, i.e. K=dnorm(x). For epanechnikov, the definition in density() function is K(u)= 3/4((1-(u/a)^2)/a) for |u|< a, with a=bw*sqrt(5), instead of the more common definition found in reference book, i.e. K(u)=3/4 (1-u^2)I(|u|≤1).

We want to be able to adjust our kernels to the data. When using histograms we want to be able to change the width of bins for histograms for them to be flexible (imagine histogram with bins in centimeters, while plotting geographical data in kilometers -- you would end up with histogram with thousands of bins if you weren't able to change their width!). The same, we want to be able to change the bandwidth (i.e. scale) of kernels since it controls their "width". If bandwidth is too small, kernel densities overfit the data, if it is too large, they underfit (try manipulating this parameter to check how it affects the results). So yes, changing the bandwidth may lead to improvements in performance, since it lets you to fit such kernel density to your data that fits it better. To learn more see the How to interpret the bandwidth value in a kernel density estimation? thread that discusses this in more detail.