Solved – Why is the area under this KDE with bandwidth=1 only .5

kernel-smoothingscikit learn

I am misunderstanding KDE. I thought area under the curve was always unity. Take this simple example and with bandwidth 1 area under curve is .5. If I make bandwidth .5 then area is 1. Please can someone explain?

import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KernelDensity
day =np.array([0,1,2,3,4,5,6])
clf=KernelDensity(bandwidth=1, kernel="tophat")
clf.fit(day.reshape(-1,1))
r = np.linspace(day.min(), day.max(), 7).reshape(-1,1)
plt.plot(range(7), np.exp(clf.score_samples(r)))
plt.show()

EDIT: added imports

Best Answer

While your basic understanding about the KDE interpretation is correct, I think you do not take in account any of the computational aspects around KDE. Therefore the phenomenon you experience has a two main causes. First, you are using an extremely coarse grid for your range r so any approximation will really exhibit discretisation artefacts. Second, you enforce the r to span exactly your sample so you experience strong edge-effects to your estimated support. Edge-effects are the phenomenon of having insufficient information near the edges of your support and thus leading the kernel estimates of your smoother to massively over-estimating the significance of data (or like here, the absence of them). As you see this issue is not strongly related to the choice of the particular bandwidth. See for example below:

    import matplotlib.pyplot as plt
    import numpy as np
    import scipy.integrate as integrate 
    from sklearn.neighbors import KernelDensity
    day =  np.array([0,1,2,3,4,5,6,7]).reshape(-1, 1)
    clf=KernelDensity( bandwidth=1, kernel="tophat").fit(day )

    # Your original setting    
    padding = 0
    numOfPoints = 7
    r = np.linspace(day.min()- padding, day.max()+ padding, numOfPoints)  
    integrate.trapz(np.exp(clf.score_samples(r.reshape(-1,1))),r )
    # 0.8020... # Pretty bad indeed

    # More points for 'r' but still strong edge effects 
    padding = 0
    numOfPoints = 7777
    r = np.linspace(day.min()- padding, day.max()+ padding, numOfPoints)  
    integrate.trapz(np.exp(clf.score_samples(r.reshape(-1,1))),r )
    # 0.8749... # Still quite bad

    # Weaker edge-effects but too coarse 'r' 
    padding = 3
    numOfPoints = 7
    r = np.linspace(day.min()- padding, day.max()+ padding, numOfPoints)  
    integrate.trapz(np.exp(clf.score_samples(r.reshape(-1,1))),r )
    # 1.0833.....  # Getting there.

    # Weaker edge-effects and more points for 'r' 
    padding = 3
    numOfPoints = 7777
    r = np.linspace(day.min()- padding, day.max()+ padding, numOfPoints)  
    integrate.trapz(np.exp(clf.score_samples(r.reshape(-1,1))),r )
    # 0.9999.... # Adequate.

A side-comment: because you are using a kernel with finite-support (ie. strong cut-offs) as the uniform/rectangular/tophat (really so many different terms to describe the same simple thing...), this makes the final kernel density estimates to be even more sub-optimal because of the original small sample-size. (You do have 7 points after all). If ones has sparse data as here using a Gaussian kernel, at first instance at least, should be preferable as it allows remote points to weakly inform the final KDE.

Related Solutions

Solved – Why is KDE output negative

The results are negative because score_samples() returns the log density.

From the help message:

Returns
-------
density : ndarray, shape (n_samples,)
    The array of log(density) evaluations

Solved – Bandwidth parameters in multivariate KDE using scipy.stats.gaussian_kde

Multivariate kernel density estimation can be defined in terms of

product of univariate kernels $K^P(\mathrm{x}) = \prod_{i=1}^{d} \kappa(x_i)$,
as a symmetric kernel $K^S(\mathrm{x}) \propto\kappa\{(\mathrm{x}'\mathrm{x})^{1/2}\}$,
or in terms of standalone multivariate kernel, e.g. multivariate Gaussian distribution.

There are also different possible choices of bandwidth matrix,

it can have equal bandwidth for each of the variables $\mathrm{H} = h^2\mathrm{I}_d$,
different for different variables $\mathrm{H} = \mathrm{diag}(h_1^2, h_2^2, \dots, h_d^2)$,
or it could be a covariance matrix.

The three choices are illustrated by Wand and Jones in their Kernel Smoothing book using the following figure of two-dimensional case.

The first choice is "symmetric", it assumes no correlation and equal variances. Second allows for unequal variances. Third allows additionally for correlation between variables.

The scipy documentation does not tell us much about the kind of multivariate kernel that they are using. It only tells us that it uses Scotts or Silvermans rules of thumb for selecting the bandwidth, so it estimates some constant $h^2$ and the covariance matrix is either same for all variables or is a scaling factor for covariance matrix (more likely, but you'd need to check the source code). Nonetheless, scipy is using rule of a thumb for choosing the bandwidth, so this does not have to be optimal choice and I'd encourage you to look for packages that implement more sophisticted approaches (R has several, I can't tell for python).

Best Answer

Related Solutions

Solved – Why is KDE output negative

Solved – Bandwidth parameters in multivariate KDE using scipy.stats.gaussian_kde

Related Question