[Math] Closed-form analytical solutions to Optimal Transport/Wasserstein distance

closed-formnormal distributionoptimal-transportprobabilitystatistics

Kuang and Tabak (2017) mentions that:

"closed-form solutions of the multidimensional optimal transport problems are relatively rare, a number of numerical algorithms have been proposed."

I'm wondering if there are some resources (lecture notes, papers, etc.) that collect/contain known solutions to optimal transport and/or Wasserstein distance between two distributions in dimensions greater than 1. For example, let $ \mathcal{N_1}(\mu_1, \Sigma_1) $ and $ \mathcal{N_2}(\mu_2, \Sigma_2) $ denote two Gaussian distributions with different means and covariances matrices. Then the optimal transport map between them is:

$$ x \longrightarrow \mu_2 + A( x – \mu_1 ) $$ where $ A = \Sigma_1^{- 1/2} (\Sigma_1^{1/2} \Sigma_2 \Sigma_1^{1/2})^{1/2} \Sigma_1^{- 1/2}$. And so the Wasserstein 2 distance is

$$ W_2 ( \mathcal{N_1}(\mu_1, \Sigma_1), \mathcal{N_2}(\mu_2, \Sigma_2) ) = || \mu_1 – \mu_2 ||^2_2 + \mathrm{Tr}( \Sigma_1 + \Sigma_2 – 2( \Sigma_1^{1/2} \Sigma_2 \Sigma_1^{1/2} )^{1/2} ) $$ where $\mathrm{Tr}$ is the trace operator.

It will be nice to know more worked out examples of optimal transport, such as uniform distributions between different geometric objects, e.g. concentric and overlapping balls, between rectangles, etc.

Best Answer

Although a bit old, this is indeed a good question. Here is my bit on the matter:

Regarding Gaussian Mixture Models: A Wasserstein-type distance in the space of Gaussian Mixture Models, Julie Delon and Agnes Desolneux, https://arxiv.org/pdf/1907.05254.pdf
Using the 2-Wasserstein metric, Mallasto and Feragen geometrize the space of Gaussian processes with $L_2$ mean and covariance functions over compact index spaces: Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes, Anton Mallasto, Aasa Feragen https://papers.nips.cc/paper/7149-learning-from-uncertain-curves-the-2-wasserstein-metric-for-gaussian-processes.pdf
Wasserstein space of elliptical distributions are characterized by Muzellec and Cuturi. Authors show that for elliptical probability distributions, Wasserstein distance can be computed via a simple Riemannian descent procedure: Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions, Boris Muzellec and Marco Cuturi https://arxiv.org/pdf/1805.07594.pdf (Not closed form)
Tree metrics as ground metrics yield negative definite OT metrics that can be computed in a closed form. Sliced-Wasserstein distance is then a particular (special) case (the tree is a chain): Tree-Sliced Variants of Wasserstein Distances, Tam Le, Makoto Yamada, Kenji Fukumizu, Marco Cuturi https://arxiv.org/pdf/1902.00342.pdf
Sinkhorn distances/divergences (Cuturi, 2013) are now treated as new forms of distances (e.g. not approximations to $\mathcal{W}_2^2$) (Genevay et al, 2019). Recently, this entropy regularized optimal transport distance is found to admit a closed form for Gaussian measures: Janati et al (2020). This fascinating finding also extends to the unbalanced case.

I would be happy to keep this list up to date and evolving.

Related Solutions

Derivation: Degree of freedom of a t-distribution.

To start, let's clear up two points of confusion:

(1) "[W]e use the t-distribution if the sample size is small."

Not exactly, if variances $\sigma_1^2,\, \sigma_2^2$ are unknown and estimated by $S_1^2,\, S_2^2,$ respectively, then you always use the t-distribution. (If sample sizes are large enough for degrees of freedom to exceed 30, then in some circumstances it is OK to use a normal approximation. But with modern software or printed t tables, the normal approximation is not necessary. The approximation works best for tests at the 5% level, not so well at 1%.)

(2) "[A]ssuming that the true standard deviations are not equal, ... then the degrees of freedom is given [by the Welch–Satterthwaite equation]."

No. This equation works whether or not $\sigma_1 = \sigma_2.$ However, if variances are not equal, you must use the Welch–Satterthwaite equation (not the pooled-variance equation with degrees of freedom $\nu = n_1 + n_2 - 2.)$

Pooled 2-sample t test: If data are normal and population variances are equal, then the test statistic for testing $H_0: \mu_1 = \mu_2$ against $H_a: \mu_1 \ne \mu_2$ is:

$$T = \frac{\bar X_1 - \bar X_2}{S_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}{}},$$ where $S_p^2 =\frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1 + n_2 - 2}.$ If $H_0$ is true, then $T$ has Student's t distribution with degrees of freedom $\nu = n_1 + n_2 - 2.$

Welch 'separate variances' 2-sample t test. However, more generally if $H_0$ is true, the test statistic

$$T^\prime = \frac{\bar X_1 + \bar X_2}{\sqrt{\frac{S_1^2}{n_1} +\frac{S_2^2}{n_2}}}.$$

is approximately distributed according to Student's t distribution with degrees of freedom $\nu$ given by the Welch-Satterthwaite equation. This is true whether or not the population variances are equal.

One can show that that degrees of freedom $\nu$ according to the Welch-Satterthwaite equation satisfies $$\min(n_1 - 1, n_2 - 1) \le \nu \le n_1 + n_2 - 2.$$ So if the smaller of the two sample sizes exceeds 30, then $\nu \ge 30$ and (testing at the 5% level) it is OK to use a normal approximation for the distribution of $T^\prime.$

Whatever the sample size, $T^\prime$ has very nearly Student's t distribution with the the Welch-Satterthwaite degrees of freedom. (This is known from probability theory and from many simulation studies.)

Which to use? The bottom line is that most statisticians use the $T^\prime$-statistic and the Welch-Satterthwaite degrees of freedom to do 2-sample t tests unless they have very strong prior evidence that population variances are equal (rarely the case). Most modern statistical software packages use the Welch 2-sample t test by default. Some programs will use $T$ with the pooled SD $S_p$ if the user overrides the default.

Notes: (a) If $n_1 = n_2,$ then one can show that $T = T^\prime$ numerically, but one should still use the Welch-Satterthwaite degrees of freedom unless the population variances are known to be equal.

(b) If sample variances $S_1^2$ and $S_2^2$ are nearly equal, then the Welch-Satterthwaite $\nu$ is near $n_1 + n_2 - 2.$ If the sample variances are far apart then $\nu$ may be considerably smaller---perhaps as small as $n_1 -1$ or $n_2 - 1.$

(c) Especially if $n_1 << n_2$ and $\sigma_2 << \sigma_1,$ then results from the pooled 2-sample test using $T$ and $S_p$ can be very misleading. (The notation $<<$ means 'much smaller than'.)

(d) It is not a good idea to test whether $\sigma_1^2 = \sigma_2^2$ in order to decide whether to use $T$ or $T^\prime.$ The test for equal variances has poor power, and simulation studies have shown that the 'hybrid' test (using $T^\prime$ only if the equal-variances test rejects) can give misleading results.

Demonstration of note (c). Using R statistical software:

Small sample from $\mathsf{Norm}(\mu_1=150,\sigma_1=30);$ larger sample from $\mathsf{Norm}(\mu_2=150,\sigma_2=5.)$ The null hypothesis is true, and so should not be rejected.

x1 = rnorm(10, 150, 30);  x2 = rnorm(50, 150, 5)

mean(x1);  sd(x1)
[1] 139.3158
[1] 31.34551
mean(x2);  sd(x2)
[1] 150.1088
[1] 5.246149

Welch 2-sample test properly fails to reject:

t.test(x1, x2)

        Welch Two Sample t-test

data:  x1 and x2
t = -1.0858, df = 9.1011, p-value = 0.3055
alternative hypothesis: true difference in means is not equal to 0
sample estimates:
mean of x mean of y 
 139.3158  150.1088

Pooled two-sample t test improperly rejects at the 5% level, 'finding' a difference in population means that does not actually exist. (The small sample with the large SD gives a misleading sample mean.)

t.test(x1, x2, var.eq=T)

        Two Sample t-test

data:  x1 and x2
t = -2.3504, df = 58, p-value = 0.02217
alternative hypothesis: true difference in means is not equal to 0
sample estimates:
mean of x mean of y 
 139.3158  150.1088

[Math] 2-Wasserstein distance between empirical distributions

I solved the problem using the MOSEK solver and got the similar results.

K = 600 
D=25 
train = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*1),size=K).T 
test = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*4),size=K).T 
M = np.zeros((K,K)) 
for i in range(K):
        for j in range(K):
            M[i,j] = np.linalg.norm(train[:,i]-test[:,j],2) 
Tra = cp.Variable((K,K),nonneg=True) 
constraint = [Tra @ np.ones((K,1)) == np.ones((K,1))/K,
                  Tra.T @ np.ones((K,1)) == np.ones((K,1))/K] 
optprob = cp.Problem(cp.Minimize(cp.trace(Tra.T@M)), constraint) 
optprob.solve(solver=cp.MOSEK) 
print(np.trace(Tra.value.T@M))

The result is around 8.3 for 25 dimension which is not surprising since for the continuous random variables, the Wasserstein distance is a continous optimization problem over all the possible joint probability distributions; while for the empirical distribution, we in fact optimize over matirx $T$, which obviously has a higher optimal objective value than the continous case.

Best Answer

Related Solutions

Derivation: Degree of freedom of a t-distribution.

[Math] 2-Wasserstein distance between empirical distributions

Related Question