Kernel Smoothing – KDE Bandwidth Estimation in R and Python

kernel-smoothingpythonr

I am trying to estimate the bandwidth parameter of a multivariate KDE in R and then use the estimate to evaluate the KDE in Python.

The reason for this somewhat convoluted approach is that my project is in Python, but, as far as I know, there is no multivariate implementation of a plug-in selector for the bandwidth in Python. So I reverted to estimate the diagonal matrix for the bandwidth with R's ks package. Unfortunately, something does not work as I expected. I boiled it down to the following example:

import pandas as pd
import numpy as np

import rpy2.robjects as ro
from rpy2.robjects.conversion import localconverter
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import statsmodels.api as sm

# ###############################################################################
# # call this part only once to install "ks"
# utils = importr('utils')
# base = importr('base')

# # select a mirror, otherwise the user is prompted
# utils.chooseCRANmirror(ind=1) # select the first mirror in the list

# # install required package "ks"
# utils.install_packages('ks')
# ###############################################################################

def get_bandwidth(data):
    ks = importr("ks")
    with localconverter(ro.default_converter + pandas2ri.converter):

        # convert pandas.DataFrame to R DataFrame
        r_from_pd_df = ro.conversion.py2rpy(data)

        # bandwidth selection with rule of thumb
        ks_hns = ks.Hns(r_from_pd_df)
        ks_hns = ro.conversion.rpy2py(ks_hns)

        # symetric bandwith (diagonal matrix)
        ks_hpi_diag = ks.Hpi_diag(r_from_pd_df)
        ks_hpi_diag = ro.conversion.rpy2py(ks_hpi_diag)

        return ks_hns, ks_hpi_diag

# Create test-data
data_x, data_y = make_blobs(
    n_samples=1000, n_features=2, centers=3, cluster_std=0.5, random_state=0
)

# re-scale one dimension
data_x[:, 0] = data_x[:, 0] * 20

plt.hexbin(data_x[:, 0], data_x[:, 1])
plt.show()

ks_hns, ks_hpi_diag = get_bandwidth(pd.DataFrame(data_x))

dens_n = sm.nonparametric.KDEMultivariate(
    data=data_x, var_type="cc", bw="normal_reference"
)
dens_cw = sm.nonparametric.KDEMultivariate(
    data=data_x, var_type="cc", bw="cv_ml"
    )

print("Python 'statsmodels' normal reference bandwidth estimate:")
print(np.diag(dens_n.bw))
print("R 'ks' normal scale bandwidth:")
print(ks_hns)

print(f"Python 'statsmodels' cross validation bandwidth estimate:")
print(np.diag(dens_cw.bw))
print("R 'ks' PI bandwidth diagonal:")
print(ks_hpi_diag)

This returns:

Python 'statsmodels' normal reference bandwidth:
[[10.66300197  0.        ]
 [ 0.          0.4962383 ]]
R 'ks' normal reference bandwidth:
[[101.29354253  -1.81323816]
 [ -1.81323816   0.21938319]]
Python 'statsmodels' cross validation bandwidth:
[[4.40348543 0.        ]
 [0.         0.19790704]]
R 'ks' PI bandwidth diagonal:
[[20.52921948  0.        ]
 [ 0.          0.04962396]]

I would expect that the results of the two implementation's normal reference (rule of thumb) will give me not exactly the same results, but at least something in the same order of magnitude. The same is true for ks's plug-in method and statsmodels' cross-validation method (ok, here I'm not that sure).

As suggested by @Josef, I plotted the different KDEs (code not shown):
enter image description here
The data for the plots on the top row are produced with statsmodels and its estimate for the bandwidth, the one on the bottom row with ks.

It seems to me from the plots that the different bandwidth produce comparable results with the corresponding implementations. For example, the two plots on the left are similar, even though the bandwidth are significantly different.

How are the estimates of the two so different?

The only idea I have is that they use different parameterizations or scales, but I couldn't find much on that. If that is the case, I would appreciate it if somebody could provide me a hint on where I could find that.

Best Answer

In case somebody is looking for an answert to this question, @whuber was correct in the comments. I just make an answer out of it so it can be found easier.

From observations I can confirm that that R's ks returns the "bandwidth matrix", which is equivalent to the covarniance matrix. Python's statsmodels on the other hand seems to return square root of the diagonal.

So, if you want use your bandwidth estimate from R's ks in Python's statsmodels, you'll just have to take the element-wise square root of the bandwidth.