Kernel Smoothing – Optimal Methods for Bandwidth Selection in Kernel Density Estimation Using R

kernel-smoothingr

I am looking for help in choosing a suitable method for bandwidth selection in kernel density estimation.

I have six data sets with 50 to 200 observations each and aim to fit a continuous univariate pdf to this data (parametric pdf do not provide a good fit). I have stumbled across a paper which compares various packages in R for KDE such as ("density" in stats, "kernsmooth", "ks", etc.) (see paper: Deng & Wickham 2011).

My understanding so far:
I have understood that the challenge of KDE is a question of choosing an optimal bandwidth (and the Kernel). Furthermore, I understand that bandwidth selection is a problem of trading-off bias and variance as we want to avoid overfitting. Similar to other learning-techniques I would want to minimize the error in my test-set, correct? Now for densities the usual error measures would be MISE or AMISE?

From reading I have found a variety of methods which I concluded go back to following:
1. Choose bandwidth according to visual fit (which may help but may be arbitrary)
2. Using plug-ins
3. Using cross-validation

All of the above packages implement one or the other techniques for choosing an optimal bandwidth (e.g. Wand and Jones 1995, Sheather & Jones 1991, Bowman & Azzaline, etc.). Also, I suppose that the choice of an appropriate method will depend on the particular data set, etc.

I have three questions:
1) Is my understanding of the topic so far correct (third paragraph)

2) What is the basic idea of using plug-ins?
I somehow understand that these are approximations of an optimal bandwidth under some specific statistical assumptions? And this is necessary because the true distribution f is not known? Is this correct?

3) Is there any agreement about "better" methods and thus R packages (among the suggested ones) to use for bandwidth selection?
Essentially I want to avoid using outdated methods and would be glad if someone could point me to the state of the art in that direction.

Any help in form of opinions, correction or further references is highly appreciated.

Best Answer

1) If you are just looking for a relatively new bandwidth selection method (which is well accepted in academia, at least by the number of citations in google.scholar and etc) you can try KDE via diffusion by Botev (2010). It is available within provenance package in R. link for package PDF here

2) In general solve-the-equation is often a benchmark for bandwidth selection since the article by:

Jones, M. C., Marron, J. S., & Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91(433), 401-407.

The both (1) and 2)) above are for the univariate case. However, the best method for you does not really depend on how new and fancy is the method but rather what kind of the data you have and what are your goals/objectives. For example, MLCV (maximum likelihood cross validation) often provides oversmoothed estimates, however if you are looking for smooth tails estimates of your density you might want to consider such a method. If you are just exploring your data a rule-of-thumb method may be sufficient or perhaps even a histogram. And finally in univariate case selecting a bandwidth (or even a series of bandwidths) is not that problematic than in the multivariate case (as dimensions of your density increase) see for your reference if interested:

Sain, S. R. (2002). Multivariate locally adaptive density estimation. Computational Statistics & Data Analysis, 39(2), 165-186.