[Math] 2-Wasserstein distance between empirical distributions

network-flowoptimal-transportprobability theory

I am trying to validate the earth-mover distance implementation from the python optimal transport library https://pot.readthedocs.io/en/stable/all.html#module-ot. It would be great if somebody could point at any flaws in what I write below.

Given two measures $\mu_1, \mu_2$ on $\mathbb{R}^N$, the 2-Wasserstein distance between them is $$W_2^2(\mu_1,\mu_2) = \inf \mathbb{E} \left[ \|x-y\|_2^2 \right]$$ where the infimum is taken over all random vectors $(x,y)\in \mathbb{R}^n \times \mathbb{R}^n$ with $x\sim \mu_1$ and $y\sim \mu_2$.

Specializing to Gaussian measures $\mu_1\sim N(m_1,\Sigma_1)$ and $\mu_2\sim N(m_2,\Sigma_2)$, one can show that $$W_2^2(\mu_1,\mu_2) = \|m_1-m_2\|_2^2 + \text{trace} \left( \Sigma_1 + \Sigma_2 – 2 \left( \Sigma_1^{1/2} \Sigma_2 \Sigma_1^{1/2} \right)^{1/2} \right)$$ (See Remark 2.30 in Peyre-Cuturi's text).

Now let's focus on the zero-centered case $m_1=m_2=0$ with diagonal covariance matrices $\Sigma_{1,ii}=1$ and $\Sigma_{2,ii}=4$. In dimensions 10, 25, 50 and 100, the above expression yields 3.16, 5.0, 7.1 and 10.0 respectively, and I seek to recover these values from the corresponding empirical distributions.

For measures with discrete support $\mu_1=\sum_{i=1}^n a_i \delta_{x_i}$ and $\mu_2=\sum_{j=1}^n b_j \delta_{y_j}$ (with weight vectors $a,b$ belonging to the n-dimensional simplex), evaluation of the Wasserstein distance corresponds to solving a network flow problem $$W_2^2 = \min_{T\in \Pi(a,b)}\text{trace}(T^TM)$$ where $M$ is the cost matrix with entries $M_{ij}=\|x_i-y_j\|_2^2$ and where $\Pi(a,b)$ denotes the transportation polytope $$\Pi(a,b) = \left\{ T \in \mathbb{R}^{n\times n}: \quad T 1_n=a, \quad T^T 1_n=b \right\}.$$

As far as I am concerned, the solution of this network flow problem is implemented in function ot.emd. If I draw samples from the two Gaussian disributions above and I apply ot.emd (with $a_i=b_i=\frac{1}{n}$) the results I get for different dimensions $N$ and sample sizes $n$ are displayed in the figure below and are very different from the expected theoretical results $W_{2,N=10}=3.16$, $W_{2,N=25}=5.0$, $W_{2,N=50}=7.1$ and $W_{2,N=100}=10.0$. The graphs represent the average over 20 independent runs and the varince is negligible. In high dimensions, they seem to be off by a factor of 2 plus some downward bias.

enter image description here

There is also a considerable upward bias associated with this estimator and a very slow convergence$\mathcal{O}(n^{-1/N})$, c.f. Peyre-Cuturi, Section 8.4 – which makes me wonder how these estimators are used in practice. Below I plot the self-distance between two standard Gaussians as a function of sample size in different dimensions, again using the average of 20 independent draws. I am familiar with the Sinkhorn-Knopp approximation to the EMD, which solves the network flow problem faster, but not more accurately.
enter image description here

Best Answer

I solved the problem using the MOSEK solver and got the similar results.

K = 600 
D=25 
train = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*1),size=K).T 
test = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*4),size=K).T 
M = np.zeros((K,K)) 
for i in range(K):
        for j in range(K):
            M[i,j] = np.linalg.norm(train[:,i]-test[:,j],2) 
Tra = cp.Variable((K,K),nonneg=True) 
constraint = [Tra @ np.ones((K,1)) == np.ones((K,1))/K,
                  Tra.T @ np.ones((K,1)) == np.ones((K,1))/K] 
optprob = cp.Problem(cp.Minimize(cp.trace(Tra.T@M)), constraint) 
optprob.solve(solver=cp.MOSEK) 
print(np.trace(Tra.value.T@M))

The result is around 8.3 for 25 dimension which is not surprising since for the continuous random variables, the Wasserstein distance is a continous optimization problem over all the possible joint probability distributions; while for the empirical distribution, we in fact optimize over matirx $T$, which obviously has a higher optimal objective value than the continous case.

Related Question