Solved – Mahalanobis distance and percentage of the distribution represented

multivariate analysisstandard deviation

In a one dimensional normal distribution, it is really handy to know that 68% of the data are within one standard deviation, 95% lies within two standard deviations, etc. My question is about the higher dimensional version of this.

Can anyone give me rules of thumb similar to the ones we use so often with standard devs, i.e., in dimension N, what percentage of a normal distribution is at Mahalanobis distance 1, 2, 3…. I kind of assume someone has come up with handy rules for this, even if they are only good approximations.

The point is to know at what Mahalanobis distance I will have excluded x% of the data.

Notes

I worked out the distribution of the Mahalanobis distance for a normal distribution in N dimensional space to be

$$ c \cdot r^{N-1} \cdot \exp\Big(\frac{-r^2}{2}\Big) $$

where $c$ is a constant. It is easy to check that the most common distance for a point will be $\sqrt{N-1}$. This means $N > 1$ is quite different than the one dimensional case because the the most frequent distance is not distance zero.

(Yes, I know one can integrate this by hand using successive integration by parts until one gets an answer in terms of $erf$ … but (1) that sounds painful for unspecified $N$, and (2) it has got to already be known)

Best Answer

I found this to be a very interesting question because it is very natural to ask but I have never seen the answer or thought about it before. Of course the answer should depend on the dimension of the normal. In researching this on the net I found that the Mahalanobis squared distance for a d-dimensional multivariate normal is chi-square with d degrees of freedom. This assumes the mean and covariance matrix are known. So from the chi-square distribution it would be easy to find in units of squared Mahalanobis distance the 90, 95 and 99 percentiles and those the ellipsoid that has that coverage.

So what I just explained elaborates on Bill Huber's correct but terse response. Although this is just taken from a chi-square table I thought it would be interesting to look at the table below from the Mahalanobis distance perspective.

TABLE OF MAHALANOBIS DISTANCE COVERING 95% OF A MULTIVARIATE NORMAL DISTRIBUTION IN D-DIMENSIONS

DIMENSION MD CHI-SQUARE (MD^2)
1 1.960 3.841
2 2.448 5.991
3 2.796 7.815
4 3.080 9.488
5 3.327 11.070
10 4.279 18.307
15 5.000 24.996
20 5.604 31.410
25 6.136 37.652
30 6.691 44.773