I am misunderstanding KDE. I thought area under the curve was always unity. Take this simple example and with bandwidth 1 area under curve is .5. If I make bandwidth .5 then area is 1. Please can someone explain?
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KernelDensity
day =np.array([0,1,2,3,4,5,6])
clf=KernelDensity(bandwidth=1, kernel="tophat")
clf.fit(day.reshape(-1,1))
r = np.linspace(day.min(), day.max(), 7).reshape(-1,1)
plt.plot(range(7), np.exp(clf.score_samples(r)))
plt.show()
EDIT: added imports
Best Answer
While your basic understanding about the KDE interpretation is correct, I think you do not take in account any of the computational aspects around KDE. Therefore the phenomenon you experience has a two main causes. First, you are using an extremely coarse grid for your range
r
so any approximation will really exhibit discretisation artefacts. Second, you enforce ther
to span exactly your sample so you experience strong edge-effects to your estimated support. Edge-effects are the phenomenon of having insufficient information near the edges of your support and thus leading the kernel estimates of your smoother to massively over-estimating the significance of data (or like here, the absence of them). As you see this issue is not strongly related to the choice of the particular bandwidth. See for example below:A side-comment: because you are using a kernel with finite-support (ie. strong cut-offs) as the
uniform
/rectangular
/tophat
(really so many different terms to describe the same simple thing...), this makes the final kernel density estimates to be even more sub-optimal because of the original small sample-size. (You do have 7 points after all). If ones has sparse data as here using a Gaussian kernel, at first instance at least, should be preferable as it allows remote points to weakly inform the final KDE.