Multivariate Analysis – Computationally Efficient Estimation of Multivariate Mode

continuous dataestimationmodemultivariate analysis

Short version: What's the most computationally efficient method of estimating the mode of a multidimensional data set, sampled from a continuous distribution?

Long version: I've got a data set that I need to estimate the mode of. The mode does not coincide with the mean or median. A sample is shown below, this is a 2D example, but an N-D solution would be better:
enter image description here

Currently, my method is

  1. Calculate kernel density estimate on a grid equal to the desired
    resolution of the mode
  2. Look for the greatest calculated point

Obviously, this calculates the KDE at a lot of non-plausible points, which is especially bad if there are a lot of data points of high dimensions or I expect good resolution on the mode.

An alternate would be to use a simulated annealing, genetic algorithm, etc to find the global peak in the KDE.

The question is whether there's a smarter method of performing this calculation?

Best Answer

The method that would fit the bill for what you want to do is the mean-shift algorithm. Essentially, mean-shift relies on moving along the direction of the gradient, which is estimated non-parametrically with the "shadow", $K'$ of a given kernel $K$. To wit, if the density $f(x)$ is estimated by $K$, then $\nabla f(x)$ is estimated by $K'$. Details of estimating the gradient of a kernel density are described in Fukunaga and Hostetler (1975), which also happened to introduce the mean-shift algorithm.

A very detailed exposition on the algorithm is also given in this blog entry.

REFERENCES:

  • K. Fukunaga and L. Hostetler, "The estimation of the gradient of a density function, with applications in pattern recognition, " IEEE Transactions on Information Theory 21(1), January 1975.
Related Question