Solved – Methods to Find the Best Bandwidth for Kernel Density Estimation

kernel-smoothingmachine learningpythonsmoothing

I have one dimensional data. All data points are larger than 0. The median and mean are about 10 and 25 respectively. The distribution appears to be lognorm but with really high frequency around the median and fat tail, so lognorm does not fit well. Then I am thinking to use Kernel Density Estimation to describe the data. I tried different ways to find the best bandwidth. (Reference: https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/)

R (reference rules)

bw.SJ(data) 
bw.nrd(data)
bw.nrd0(data)
bw.ucv(data)

All results are too small (smaller than 0.2) and the graph shows too many bumps, which makes it difficult to analyze.

Python sklearn (cross validation)

grid = GridSearchCV(KernelDensity(), {'bandwidth': np.linspace(0.1, 1.0, 30)}, cv=20)

I ended up testing values of bandwidths from 0.01 to 50 and the best one was 20. Since 20 is too large, the graph is almost flat and does not fit the data at all.

Do you have any ideas why these methods do not work well with my data? Could you tell me other methods to find better bandwidths?

Best Answer

  1. Your data may be truly multimodal. The bandwidths that you mention work well asymptotically. If your data is large enough, and does not contain many outliers, then your data may be multimodal.
  2. Your data contain many outliers. This is a big issue with KDE since the bandwidth is sensitive to the presence of outliers. The fitted KDE may be way off in such cases.
  3. If you believe your data is unimodal, you may want to compare the fit of the KDE with that of a log concave estimator. These are implemented in the R packages logcondens and LogConcDEAD.
  4. Try a parametric alternative such a log Student t distribution.
  5. Fit your data in the log scale. This may help you visualise some features more clearly and remove the effect of outliers.