Solved – Adaptive kernel density estimators

k nearest neighbourkde

Can anyone report on their experience with an adaptive kernel density estimator?
(There are many synonyms:
adaptive | variable | variable-width, KDE | histogram | interpolator …)

Variable kernel density estimation
says "we vary the width of the kernel in different regions of the sample space.
There are two methods …"
actually, more:
neighbors within some radius,
KNN nearest neighbors (K usually fixed),
Kd trees, multigrid…
Of course no single method can do everything,
but adaptive methods look attractive.
See for example the nice picture of an adaptive 2d mesh in
Finite element method.

I'd like to hear what worked / what didn't work for real data,
especially >= 100k scattered data points in 2d or 3d.

Added 2 Nov: here's a plot of a "clumpy" density (piecewise x^2 * y^2),
a nearest-neighbor estimate, and Gaussian KDE with Scott's factor.
While one (1) example doesn't prove anything,
it does show that NN can fit sharp hills reasonably well
(and, using KD trees, is fast in 2d, 3d …)
alt text

Best Answer

The article * D. G. Terrell; D. W. Scott (1992). "Variable kernel density estimation". Annals of Statistics 20: 1236–1265.* cited at the end of the Wikipedia article you yourself cite clearly states that unless the observations space is very sparse the variable kernel method is not recommended on the basis of global root mean squared error (both local and global) for Gaussian distributed random variables: (through theoretical arguments) they cite the figures of $n\leq 450$ ($n$ is the sample size) and (through bootstrapping results) $p\geq 4$ ($p$ is the number of dimension) as the settings in which variable kernel method become competitive with fixed width ones (judging from your question you are not in these settings).

The intuition behind these results is that if you are not in very sparse settings, then, the local density simply does not vary enough for the gain in bias to outdo the loss in efficiency (and hence the AMISE of variable width kernel increases relative to the AMISE of fixed width). Also, given the large sample size you have (and the small dimensions) the fixed width kernel will be very local already, diminishing any potential gains in terms of bias.