Solved – Bandwidth parameters in multivariate KDE using scipy.stats.gaussian_kde

density-estimationkernel-smoothingnonparametric-densitypythonscipy

I am working on a project which involves implementing in Python two different density estimation functions over multivariate data; one using N-d histograms and the other using kernel density estimation (KDE).

I used scipy.stats.gaussian_kde, and realised that as the dimensions differed in variance in my underlying data, the KDE function was less able to accurately estimate the underlying probability density.

It seems that we need to have a different bandwidth parameter for each dimension (manifest, I gather, as a bandwidth matrix). scipy.stats.gaussian_kde does not appear to support bandwidth matrix arguments, so my questions are:

  1. Am I right in thinking that this function is flawed/useless without performing some normalisation-type operation on the data to ensure SD is somehow the same for each dimension? If so, it seems like a glaring problem in the implementation.

  2. What is the simplest Python alternative to plug into my code, to calculate KDE given a bandwidth matrix? (Ideally with a similar interface to avoid much refactoring)

Best Answer

Multivariate kernel density estimation can be defined in terms of

  1. product of univariate kernels $K^P(\mathrm{x}) = \prod_{i=1}^{d} \kappa(x_i)$,
  2. as a symmetric kernel $K^S(\mathrm{x}) \propto\kappa\{(\mathrm{x}'\mathrm{x})^{1/2}\}$,
  3. or in terms of standalone multivariate kernel, e.g. multivariate Gaussian distribution.

There are also different possible choices of bandwidth matrix,

  1. it can have equal bandwidth for each of the variables $\mathrm{H} = h^2\mathrm{I}_d$,
  2. different for different variables $\mathrm{H} = \mathrm{diag}(h_1^2, h_2^2, \dots, h_d^2)$,
  3. or it could be a covariance matrix.

The three choices are illustrated by Wand and Jones in their Kernel Smoothing book using the following figure of two-dimensional case.

enter image description here

The first choice is "symmetric", it assumes no correlation and equal variances. Second allows for unequal variances. Third allows additionally for correlation between variables.

The scipy documentation does not tell us much about the kind of multivariate kernel that they are using. It only tells us that it uses Scotts or Silvermans rules of thumb for selecting the bandwidth, so it estimates some constant $h^2$ and the covariance matrix is either same for all variables or is a scaling factor for covariance matrix (more likely, but you'd need to check the source code). Nonetheless, scipy is using rule of a thumb for choosing the bandwidth, so this does not have to be optimal choice and I'd encourage you to look for packages that implement more sophisticted approaches (R has several, I can't tell for python).

Related Question