Solved – Mahalanobis distance measure for clustering

clusteringdistancedistance-functionsdistributions

Let's say I have a group of clusters. Would you recommend Mahalanobis distance measure for checking if new arrived data belongs to existing clusters or it is an outlier?

Also, would you recommend this distance measure during clustering and in which cases?

Thanks

Best Answer

This is a multivariate Gaussian:

$$ f(x;\mu,\Sigma) = \frac{1}{\sqrt{(2\pi)^{n}|\Sigma|}}e^{(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu))} $$

Mahalanobis distance is related to the power of the exponential: $$MD = \sqrt{(x-\mu)^T\Sigma^{-1}(x-\mu)}$$

So I would say if your underlying distributions are multivariate gaussians, Mahalanobis distance seems useful. The major problem is estimating the precision matrix $\Sigma^{-1}$ for cases that are high dimensional with few observations.

If you have no choice but to perform automated outlier detection, then there are some nice interpretable qualities about MD. In the univariate case, this normalized distance is equivalent to the number of standard deviations from the mean. If your data is indeed normal, it's common to call the points that are '$n$' standard deviations from the mean as outliers ($ MD > n$). If you choose $n=2$ as the outlier threshold, you would be rejecting points that exceed the 95 percentile of the underlying distribution in the univariate case (approximately).

In the multivariate case, the curse of dimensionality comes into play. If you wanted to keep the 95 percentile rule, you would need to reject data based on the quantile of a $\chi$ distribution, as explained here.