Mahalanobis Distance – Understanding the Distribution of Observation-Level Mahalanobis Distance

multivariate analysisoutliers

If I have a multivariate normal i.i.d. sample $X_1, \ldots, X_n \sim N_p(\mu,\Sigma)$, and define $$d_i^2(b,A) = (X_i – b)' A^{-1} (X_i – b)$$ (which is sort of a Mahalanobis distance [squared] from a sample point to the vector $a$ using the matrix $A$ for weighting), what is the distribution of $d_i^2(\bar X,S)$ (Mahalanobis distance to the sample mean $\bar X$ using the sample covariance matrix $S$)?

I am looking at a paper that claims it is $\chi^2_p$, but this is obviously wrong: the $\chi^2_p$ distribution would have been obtained for $d_i^2(\mu,\Sigma)$ using the (unknown) population mean vector and covariance matrix. When the sample analogues are plugged in, one ought to get a Hotelling $T^{\ 2}$ distribution, or a scaled $F(\cdot)$ distribution, or something like that, but not the $\chi^2_p$. I could not find the exact result either in Muirhead (2005), nor in Anderson (2003), nor in Mardia, Kent and Bibby (1979, 2003) . Apparently, these guys did not bother with outlier diagnostics, as the multivariate normal distribution is perfect and is easily obtained every time one collects multivariate data :-/.

Things may be more complicated than that. The Hotelling $T^{\ 2}$ distribution result is based on assuming independence between the vector part and the matrix part; such independence holds for $\bar X$ and $S$, but it no longer holds for $X_i$ and $S$.

Best Answer

Check out Gaussian Mixture Modeling by Exploiting the Mahalanobis Distance (alternative link). See page no 13, Second column. Authors also given some proof also for deriving the distribution. The distribution is scaled beta. Please let me know if this is not working for you. Otherwise I could check any hint in the S.S. Wilks book tomorrow.

Related Question