Solved – How to choose the bandwidth of a KDE in python

cross-validationkernel-smoothing

Python's Sklearn module provides methods to perform Kernel Density Estimation.
One of the challenges in Kernel Density Estimation is the correct choice of the kernel-bandwidth.

I have come across the following python-expression to select a bandwidth:

grid = GridSearchCV(KernelDensity(kernel = 'gaussian'),{'bandwidth': np.linspace(0.1, 0.5, 20)}, cv = 5, iid = True)

Here, GridSearchCV is a method that performs K-Fold Cross-Validation. Here is how I understand it:

We split the data, whose density is to be estimated, into K subsets. We then train the Kernel-Density-Estimation Algorithm with the data points of K-1 subsets. And finally, we evaluate the accuracy of our found parameters on the remaining subset. We repeat the process K times, each time choosing a different subset for testing.

Now, here is my question: In the given python-expression: Which is the algorithm that KernelDensity(kernel = 'gaussian') makes use of?
Is it a nearest-neighbor algorithm? Does this mean: We consider a data point and place a Gaussian onto it. We will now have a look at all the data points that fall within this Gaussian. From their values, we estimate the value of the data point that we have placed the Gaussian on.

Does anything that I am saying make sense?

Best Answer

The grid search CV is sensible because having a PDF estimation with data of $K-1$ folds and testing on the holdout set, by calculating the data log-likelihood, i.e. $\sum \log \hat{p}(x_i)$, you can get an estimate of how good is your KDE.

Normally, for a given query point, KDE uses all the points in the data to estimate its density, i.e. $f_X(x)=\frac{1}{n}\sum K_h(x-x_i)$, when the kernel has infinite support. However, distant points have negligible effect (e.g. adding $10^{-16}$ to $1$ for example), and to be effective/practical some number of neighbours are used, depending on your bandwidth. This can be done efficiently via data structures like ball tree or kd tree as @user11852 also pointed out.