Solved – Why is the area under this KDE with bandwidth=1 only .5

kernel-smoothingscikit learn

I am misunderstanding KDE. I thought area under the curve was always unity. Take this simple example and with bandwidth 1 area under curve is .5. If I make bandwidth .5 then area is 1. Please can someone explain?

import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KernelDensity
day =np.array([0,1,2,3,4,5,6])
clf=KernelDensity(bandwidth=1, kernel="tophat")
clf.fit(day.reshape(-1,1))
r = np.linspace(day.min(), day.max(), 7).reshape(-1,1)
plt.plot(range(7), np.exp(clf.score_samples(r)))
plt.show()

EDIT: added imports

Best Answer

While your basic understanding about the KDE interpretation is correct, I think you do not take in account any of the computational aspects around KDE. Therefore the phenomenon you experience has a two main causes. First, you are using an extremely coarse grid for your range r so any approximation will really exhibit discretisation artefacts. Second, you enforce the r to span exactly your sample so you experience strong edge-effects to your estimated support. Edge-effects are the phenomenon of having insufficient information near the edges of your support and thus leading the kernel estimates of your smoother to massively over-estimating the significance of data (or like here, the absence of them). As you see this issue is not strongly related to the choice of the particular bandwidth. See for example below:

    import matplotlib.pyplot as plt
    import numpy as np
    import scipy.integrate as integrate 
    from sklearn.neighbors import KernelDensity
    day =  np.array([0,1,2,3,4,5,6,7]).reshape(-1, 1)
    clf=KernelDensity( bandwidth=1, kernel="tophat").fit(day )

    # Your original setting    
    padding = 0
    numOfPoints = 7
    r = np.linspace(day.min()- padding, day.max()+ padding, numOfPoints)  
    integrate.trapz(np.exp(clf.score_samples(r.reshape(-1,1))),r )
    # 0.8020... # Pretty bad indeed

    # More points for 'r' but still strong edge effects 
    padding = 0
    numOfPoints = 7777
    r = np.linspace(day.min()- padding, day.max()+ padding, numOfPoints)  
    integrate.trapz(np.exp(clf.score_samples(r.reshape(-1,1))),r )
    # 0.8749... # Still quite bad

    # Weaker edge-effects but too coarse 'r' 
    padding = 3
    numOfPoints = 7
    r = np.linspace(day.min()- padding, day.max()+ padding, numOfPoints)  
    integrate.trapz(np.exp(clf.score_samples(r.reshape(-1,1))),r )
    # 1.0833.....  # Getting there.

    # Weaker edge-effects and more points for 'r' 
    padding = 3
    numOfPoints = 7777
    r = np.linspace(day.min()- padding, day.max()+ padding, numOfPoints)  
    integrate.trapz(np.exp(clf.score_samples(r.reshape(-1,1))),r )
    # 0.9999.... # Adequate.

A side-comment: because you are using a kernel with finite-support (ie. strong cut-offs) as the uniform/rectangular/tophat (really so many different terms to describe the same simple thing...), this makes the final kernel density estimates to be even more sub-optimal because of the original small sample-size. (You do have 7 points after all). If ones has sparse data as here using a Gaussian kernel, at first instance at least, should be preferable as it allows remote points to weakly inform the final KDE.

Related Question