Kernel Smoothing – Efficient Evaluation of Multidimensional Kernel Density Estimate

kernel-smoothing

I've seen a reasonable amount of literature about how to choose kernels and bandwidths when computing a kernel density estimate, but I am currently interested in how to improve the time it takes to evaluate the resulting KDE at an arbitrary number of points.

In my case, I'm using a multidimensional (2D or 3D) Gaussian kernel with diagonal covariance (i.e. each dimension is independent). The bandwidths in each dimension may differ and are selected using nearest neighbours. However, my question probably extends to different kernels and bandwidth selection methods.

Let's say I have $M$ data points and wish to evaluate the resulting KDE at $N$ grid points. A simple implementation involves evaluating the multivariate normal pdf $MN$ times. For my purposes, $M$ and $N$ are both on the order of thousands, and evaluation has become the bottleneck in my code.

I am not aware of whether there are any generally-accepted improvements to this basic method. I found this paper, which claims to reduce the complexity from $O(MN)$ to $O(M+N)$. However, the method has not been implemented in any 'standard' R or Python libraries that I'm aware of – which suggests it hasn't yet been widely adopted?

Thanks for any pointers you can give.

Best Answer

I'm going to provide an (incomplete) answer here in case it helps anyone else out.

  1. There are several recent mathematical methods for computing the KDE more efficiently. One is the Fast Gauss Transform, published in several studies including this one. Another is to use a tree-based approach (KD tree or ball tree) to work out which sources contribute to a given grid point. Unclear whether this has been published, but it is implemented in Scikit-learn and based on methods developed by Jake Vanderplas.
  2. If these methods are a bit fiddly, it's possible to write something a bit more basic that achieves a similar task. I tried constructing a cuboid around each grid point, with side lengths related to the bandwidth in each of those dimensions. This doesn't allow great control of errors, though it does give you some speed up.
  3. Finally, computing the KDE is quite easily parallelisable, either on multiple CPU cores or on a GPU. I'm considering implementing a KDE in CUDA, but haven't done that yet.