Solved – Multivariate Wasserstein metric for $n$-dimensions

distancemetricmultivariate analysiswasserstein

I am a vegetation ecologist and poor student of computer science who recently learned of the Wasserstein metric. Application of this metric to 1d distributions I find fairly intuitive, and inspection of the wasserstein1d function from transport package in R helped me to understand its computation, with the following line most critical to my understanding:

mean(abs(sort(b) - sort(a))^p)^(1/p)

In the case where the two vectors a and b are of unequal length, it appears that this function interpolates, inserting values within each vector, which are duplicates of the source data until the lengths are equal.

My question has to do with extending the Wasserstein metric to n-dimensional distributions. With the following 7d example dataset generated in R:

d <- 7
obs <- 100

d7a <- matrix(nrow = obs, ncol = d, data = 0)
d7b <- matrix(nrow = obs, ncol = d, data = 0)

set.seed(123)

for(i in 1:7){
  d7a[,i] <- rnorm(obs)
  d7b[,i] <- rnorm(obs)
}

wassersteindNd(d7a, d7b) #fictitious function here

Is it possible to compute this distance, and are there packages available in R or python that do this?

Best Answer

Wasserstein in 1D is a special case of optimal transport. Both the R wasserstein1d and Python scipy.stats.wasserstein_distance are intended solely for the 1D special case. The algorithm behind both functions rank discrete data according to their c.d.f.'s so that the distances and amounts to move are multiplied together for corresponding points between $u$ and $v$ nearest to one another. More on the 1D special case can be found in Remark 2.28 of Peyre and Cuturi's Computational optimal transport.

The 1D special case is much easier than implementing linear programming, which is the approach that must be followed for higher-dimensional couplings. Linear programming for optimal transport is hardly anymore harder computation-wise than the ranking algorithm of 1D Wasserstein however, being fairly efficient and low-overhead itself. wasserstein1d and scipy.stats.wasserstein_distance do not conduct linear programming.

What you're asking about might not really have anything to do with higher dimensions though, because you first said "two vectors a and b are of unequal length". If the source and target distributions are of unequal length, this is not really a problem of higher dimensions (since after all, there are just "two vectors a and b"), but a problem of unbalanced distributions (i.e. "unequal length"), which is in itself another special case of optimal transport that might admit difficulties in the Wasserstein optimization. Some work-arounds for dealing with unbalanced optimal transport have already been developed of course.

If it really is higher-dimensional, multivariate transportation that you're after (not necessarily unbalanced OT), you shouldn't pursue your attempted code any further since you apparently are just trying to extend the 1D special case of Wasserstein when in fact you can't extend that 1D special case to a multivariate setting. Look into linear programming instead. The pot package in Python, for starters, is well-known, whose documentation addresses the 1D special case, 2D, unbalanced OT, discrete-to-continuous and more.

Related Question