Solved – Lay person’s explanation of Sheather-Jones method for bandwidth selection

density-estimationkernel-smoothing

I'm currently writing my bioinformatics thesis. One part of what I'm doing involves performing a KDE on univariate data. I'm using R, and in the function documentation (for the density function) it says the Sheather-Jones bandwidth selection method is generally recommended. After trying a few different methods, I did notice that SJ gave the best results. I also saw online (including here on CrossValidated) that people usually say it's the recommended method.

Upon reviewing my thesis, one of my committee members asked me to briefly describe the method. I honestly have ZERO idea what it actually does. I tried reading their paper, but unfortunately both the abstract and the intro didn't really have any kind of simple lay person's explanation of what their method actually does. I could not find much information online. I'm not one to easily "outsource" my problems to the online community, I usually try to figure out everything on my own, but this seems to be way out of my league and I really need some help here.

Does anyone have a very brief 1-2 sentence explanation of how SJ bandwidth selection works or why it's regarded as the most popular?
Thanks

Best Answer

Assuming it is indeed the preferred bandwidth estimator in Sheather and Jones' (1991) JRSS-B paper [1] that you mean (specifically, $\hat{h}_{2S}$), here's a brief discussion (as requested), but a brief discussion of a highly technical topic is necessarily a little vague and cryptic.

The basic issue of finding efficient$^\dagger$ bandwidth estimators boils down to finding a good estimate of $R(f'')$ (where $R(g) = \int g^2(x) dx$), the integrated squared second derivative of the density to be estimated -- i.e. the asymptotically optimal bandwidth depends on the second derivative of the very thing we wish to estimate!

$^\dagger$ here, specifically in the sense of minimum asymptotic mean integrated squared error (AMISE) ... about which, see here

Why does the integrated squared second derivative matter? In effect it measures how "wiggly" the curve is over the range you're looking at. If you have a very wiggly curve you won't get a good estimate of it with a wide bandwidth because you'll average over a bunch of wiggles instead of following them. If you have a curve that's pretty straight it makes sense to have a much wider bandwidth (since you can reduce the noise in your estimate by including more data).

A number of bandwidth estimators use (in turn) a kernel based estimate of $R(f'')$.

Sheather and Jones include a bias term in their estimate of $R(f'')$ that had previously been neglected. This results in estimating $R(f''')$ (a lot of detail is being glossed over here).

How to summarize all that? It's an improved version of a kernel-based estimate of the optimal bandwidth, not that this is likely to help much.

As to why it's popular (I won't engage in idle discussion of whether it's the most popular, since it seems impossible to reliably assess), the abstract gives a highly plausible reason:

reliably good performance for smooth densities in simulations [...] second to none in the existing literature

i.e. it (demonstrably) works well in practice for a reasonably broad set of of cases.

[There have been - unsurprisingly - further suggested improvements in the last quarter century, but this bandwidth estimator remains popular.]

[1] S. J. Sheather, and M. C. Jones. (1991)
"A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation."
Journal of the Royal Statistical Society. Series B, 53 (3) pp683-690