I need to calculate the sample Mahalanobis distance in R between every pair of observations in a $n \times p$ matrix of covariates. I need a solution that is efficient, i.e. only $n(n-1)/2$ distances are calculated, and preferably implemented in C/RCpp/Fortran etc. I assume that $\Sigma$, the population covariance matrix, is unknown and use the sample covariance matrix in its place.
I am particularly interested in this question since there seems to be no "consensus" method for calculating pairwise Mahalanobis distances in R, i.e. it is not implemented in the dist
function nor in the cluster::daisy
function. The mahalanobis
function does not calculate pairwise distances without additional work from the programmer.
This was already asked here Pairwise Mahalanobis distance in R, but the solutions there seem incorrect.
Here is a correct but terribly inefficient (since $n \times n$ distances are calculated) method:
set.seed(0)
x0 <- MASS::mvrnorm(33,1:10,diag(c(seq(1,1/2,l=10)),10))
dM = as.dist(apply(x0, 1, function(i) mahalanobis(x0, i, cov = cov(x0))))
This is easy enough to code myself in C, but I feel like something this basic should have a preexisting solution. Is there one?
There are other solutions that fall short: HDMD::pairwise.mahalanobis()
calculates $n \times n$ distances, when only $n(n-1)/2$ unique distances are required. compositions::MahalanobisDist()
seems promising, but I don't want my function to come from a package that depends on rgl
, which severely limits others' ability to run my code. Unless this implementation is perfect, I'd rather write my own. Anybody have experience with this function?
Best Answer
Starting from ahfoss's "succint" solution, I have used the Cholesky decomposition in place of the SVD.
It should be faster, because forward-solving a triangular system is faster then dense matrix multiplication with the inverse covariance (see here). Here are the benchmarks with ahfoss's and whuber's solutions in several settings:
So Cholesky seems to be uniformly faster.