Pairwise Mahalanobis Distances: Calculation and Applications

algorithmsdistancer

I need to calculate the sample Mahalanobis distance in R between every pair of observations in a $n \times p$ matrix of covariates. I need a solution that is efficient, i.e. only $n(n-1)/2$ distances are calculated, and preferably implemented in C/RCpp/Fortran etc. I assume that $\Sigma$, the population covariance matrix, is unknown and use the sample covariance matrix in its place.

I am particularly interested in this question since there seems to be no "consensus" method for calculating pairwise Mahalanobis distances in R, i.e. it is not implemented in the dist function nor in the cluster::daisy function. The mahalanobis function does not calculate pairwise distances without additional work from the programmer.

This was already asked here Pairwise Mahalanobis distance in R, but the solutions there seem incorrect.

Here is a correct but terribly inefficient (since $n \times n$ distances are calculated) method:

set.seed(0)
x0 <- MASS::mvrnorm(33,1:10,diag(c(seq(1,1/2,l=10)),10))
dM = as.dist(apply(x0, 1, function(i) mahalanobis(x0, i, cov = cov(x0))))

This is easy enough to code myself in C, but I feel like something this basic should have a preexisting solution. Is there one?

There are other solutions that fall short: HDMD::pairwise.mahalanobis() calculates $n \times n$ distances, when only $n(n-1)/2$ unique distances are required. compositions::MahalanobisDist() seems promising, but I don't want my function to come from a package that depends on rgl, which severely limits others' ability to run my code. Unless this implementation is perfect, I'd rather write my own. Anybody have experience with this function?

Best Answer

Starting from ahfoss's "succint" solution, I have used the Cholesky decomposition in place of the SVD.

cholMaha <- function(X) {
 dec <- chol( cov(X) )
 tmp <- forwardsolve(t(dec), t(X) )
 dist(t(tmp))
}

It should be faster, because forward-solving a triangular system is faster then dense matrix multiplication with the inverse covariance (see here). Here are the benchmarks with ahfoss's and whuber's solutions in several settings:

 require(microbenchmark)
 set.seed(26565)
 N <- 100
 d <- 10

 X <- matrix(rnorm(N*d), N, d)

 A <- cholMaha( X = X ) 
 A1 <- fastPwMahal(x1 = X, invCovMat = solve(cov(X))) 
 sum(abs(A - A1)) 
 # [1] 5.973666e-12  Ressuring!

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X))
Unit: microseconds
expr          min       lq   median       uq      max neval
cholMaha    502.368 508.3750 512.3210 516.8960  542.806   100
fastPwMahal 634.439 640.7235 645.8575 651.3745 1469.112   100
mahal       839.772 850.4580 857.4405 871.0260 1856.032   100

 N <- 10
 d <- 5
 X <- matrix(rnorm(N*d), N, d)

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X)
                    )
Unit: microseconds
expr          min       lq    median       uq      max neval
cholMaha    112.235 116.9845 119.114 122.3970  169.924   100
fastPwMahal 195.415 201.5620 205.124 208.3365 1273.486   100
mahal       163.149 169.3650 172.927 175.9650  311.422   100

 N <- 500
 d <- 15
 X <- matrix(rnorm(N*d), N, d)

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X)
                    )
Unit: milliseconds
expr          min       lq     median       uq      max neval
cholMaha    14.58551 14.62484 14.74804 14.92414 41.70873   100
fastPwMahal 14.79692 14.91129 14.96545 15.19139 15.84825   100
mahal       12.65825 14.11171 39.43599 40.26598 41.77186   100

 N <- 500
 d <- 5
 X <- matrix(rnorm(N*d), N, d)

   microbenchmark(cholMaha(X),
                  fastPwMahal(x1 = X, invCovMat = solve(cov(X))),
                  mahal(x = X)
                    )
Unit: milliseconds
expr           min        lq      median        uq       max neval
cholMaha     5.007198  5.030110  5.115941  5.257862  6.031427   100
fastPwMahal  5.082696  5.143914  5.245919  5.457050  6.232565   100
mahal        10.312487 12.215657 37.094138 37.986501 40.153222   100

So Cholesky seems to be uniformly faster.

Best Answer

Related Solutions

Solved – Log-likelihood distance measure validity for clustering

Solved – Problem with estimating probability using the multivariate Gaussian

Avoiding the numeric problem via logarithms

Statistical interpretation of Euclidean distance

Related Question