Solved – Bandwidth estimation in gaussian_kde in scipy

numpypythonscipy

In gaussian_kde from scipy library there are two methods to estimate the bandwidth, "scott" and "silverman"

The silverman rule of thumb is explained here and the equivalent function in R is provided here

So my question is, why the Silverman method in "gaussian_kde" doesn't look like same. You can see the code here . There is nothing about the variance and other stuff. the only parameters that the methods use is the dimension of the data !

Last but not least, I to have different number of bandwidth given the number of dimension. Each dimension should have its own bandwidth.

might be I am mistaken and this "silverman_factor" is not exactly silver man estimation. if so some one provide me how can I calculate the bandwidth using this library or other library in Python?

UPDATE

before going to the code, I would like to remind myself the Silverman's rule of thumb for bandwidth selection. So basically it is equal to 1,06*std*n**1/5 where std is standard deviation and n is the dimension of data.

here is a R impelentation

library(MASS)
n.grid = 25

x = c(1, 9, 4, 10)
y = c(11, 3, 5, 7)
delta = c(bcv(x), bcv(y)) # 14.66375 11.78500
kde2d.xy <- kde2d(x, y, n = 25, h = delta)
FXY <- kde2d.xy$z + .Machine$double.eps
dx <- kde2d.xy$x[2] - kde2d.xy$x[1]
dy <- kde2d.xy$y[2] - kde2d.xy$y[1]
PXY <- FXY/(sum(FXY)*dx*dy)
PX <- rowSums(PXY)*dy
PY <- colSums(PXY)*dx
HXY <- -sum(PXY * log(PXY))*dx*dy #entropy

and the equivalent python code :

import numpy as np
import scipy.stats as stats
from scipy.stats.stats import pearsonr


def colSum(x):
    return sum(x)

def rowSum(x):
    return sum(x.T)

a = (1, 9, 4, 10)
b = (11, 3, 5, 7)
n = 25
tt = np.array( [ a, b ] )
rvs = tt.T

kde = stats.gaussian_kde(rvs.T, bw_method='silverman')
#~ kde = stats.gaussian_kde(rvs.T)
#~ kde.set_bandwidth(11)
kde.silverman_factor()

x_flat = np.r_[rvs[:,0].min():rvs[:,0].max():25j]
y_flat = np.r_[rvs[:,1].min():rvs[:,1].max():25j]

# Regular grid to evaluate kde upon
x,y = np.meshgrid(x_flat,y_flat)

grid_coords = np.append(x.reshape(-1,1),y.reshape(-1,1),axis=1)
z = kde(grid_coords.T)
z = z.reshape(n,n)

rho = pearsonr(a, b)[0]

FXY = z + np.finfo(np.float).eps
dx = x[0][2] - x[0][1]
dy = y[1][0] - y[0][0]
PXY = FXY/sum(sum(FXY))*dx*dy
PX = rowSum(PXY)*dy
PY = colSum(PXY)*dx
HXY = -sum(sum(PXY * np.log(PXY)))*dx*dy

print 'X : ' + str(a)
print 'Y : ' + str(b)
print 'bw : ' + str(kde.covariance_factor())
print kde.covariance
print 'entropy: ' + str(HXY)

         bw1     bw2    H
python   11.34   7.39   0.13
R        14.66   11.78  4.32

I also have to add from what @rroowwllaanndd said the actual silverman factor is calculated in _compute_covariance function and the bandwidth values are diagonal of kde_object.covariance matrix even though kde_object.silverman_factor() does not return these diagonals.

It is also very confusing that when I feed the R code with band width values from python, the entropy is not comparable to the python's value.

library(MASS)
n.grid = 25

x = c(1, 9, 4, 10)
y = c(11, 3, 5, 7)
# delta = c(bcv(x), bcv(y))
delta = c(11.34, 7.39) # scipy bandwidth estimations from python code
kde2d.xy <- kde2d(x, y, n = n.grid, h = delta)
FXY <- kde2d.xy$z + .Machine$double.eps
dx <- kde2d.xy$x[2] - kde2d.xy$x[1]
dy <- kde2d.xy$y[2] - kde2d.xy$y[1]
PXY <- FXY/(sum(FXY)*dx*dy)
PX <- rowSums(PXY)*dy
PY <- colSums(PXY)*dx
HXY <- -sum(PXY * log(PXY))*dx*dy

Best Answer

Actually, the Scipy implementation of the bandwidth estimate does depend on the variance of each data dimension.

In the original code that you've linked to the _compute_covariance method sets the covariance matrix for the Gaussian kernel as the product of the factor provided by calling either scotts_factor or silverman_factor and the data covariance.

So, in summary, it looks like what's being calculated is equivalent to the original formulae, but the calculation is decomposed such that the scotts_factor and silverman_factor functions just capture the portion of the calculation that is unique to those methods - which explains why it looks different to the R code.