I am trying to validate the earth-mover distance implementation from the python optimal transport library https://pot.readthedocs.io/en/stable/all.html#module-ot. It would be great if somebody could point at any flaws in what I write below.
Given two measures $\mu_1, \mu_2$ on $\mathbb{R}^N$, the 2-Wasserstein distance between them is $$W_2^2(\mu_1,\mu_2) = \inf \mathbb{E} \left[ \|x-y\|_2^2 \right]$$ where the infimum is taken over all random vectors $(x,y)\in \mathbb{R}^n \times \mathbb{R}^n$ with $x\sim \mu_1$ and $y\sim \mu_2$.
Specializing to Gaussian measures $\mu_1\sim N(m_1,\Sigma_1)$ and $\mu_2\sim N(m_2,\Sigma_2)$, one can show that $$W_2^2(\mu_1,\mu_2) = \|m_1-m_2\|_2^2 + \text{trace} \left( \Sigma_1 + \Sigma_2 – 2 \left( \Sigma_1^{1/2} \Sigma_2 \Sigma_1^{1/2} \right)^{1/2} \right)$$ (See Remark 2.30 in Peyre-Cuturi's text).
Now let's focus on the zero-centered case $m_1=m_2=0$ with diagonal covariance matrices $\Sigma_{1,ii}=1$ and $\Sigma_{2,ii}=4$. In dimensions 10, 25, 50 and 100, the above expression yields 3.16, 5.0, 7.1 and 10.0 respectively, and I seek to recover these values from the corresponding empirical distributions.
For measures with discrete support $\mu_1=\sum_{i=1}^n a_i \delta_{x_i}$ and $\mu_2=\sum_{j=1}^n b_j \delta_{y_j}$ (with weight vectors $a,b$ belonging to the n-dimensional simplex), evaluation of the Wasserstein distance corresponds to solving a network flow problem $$W_2^2 = \min_{T\in \Pi(a,b)}\text{trace}(T^TM)$$ where $M$ is the cost matrix with entries $M_{ij}=\|x_i-y_j\|_2^2$ and where $\Pi(a,b)$ denotes the transportation polytope $$\Pi(a,b) = \left\{ T \in \mathbb{R}^{n\times n}: \quad T 1_n=a, \quad T^T 1_n=b \right\}.$$
As far as I am concerned, the solution of this network flow problem is implemented in function ot.emd. If I draw samples from the two Gaussian disributions above and I apply ot.emd (with $a_i=b_i=\frac{1}{n}$) the results I get for different dimensions $N$ and sample sizes $n$ are displayed in the figure below and are very different from the expected theoretical results $W_{2,N=10}=3.16$, $W_{2,N=25}=5.0$, $W_{2,N=50}=7.1$ and $W_{2,N=100}=10.0$. The graphs represent the average over 20 independent runs and the varince is negligible. In high dimensions, they seem to be off by a factor of 2 plus some downward bias.
There is also a considerable upward bias associated with this estimator and a very slow convergence – $\mathcal{O}(n^{-1/N})$, c.f. Peyre-Cuturi, Section 8.4 – which makes me wonder how these estimators are used in practice. Below I plot the self-distance between two standard Gaussians as a function of sample size in different dimensions, again using the average of 20 independent draws. I am familiar with the Sinkhorn-Knopp approximation to the EMD, which solves the network flow problem faster, but not more accurately.
Best Answer
I solved the problem using the MOSEK solver and got the similar results.
The result is around 8.3 for 25 dimension which is not surprising since for the continuous random variables, the Wasserstein distance is a continous optimization problem over all the possible joint probability distributions; while for the empirical distribution, we in fact optimize over matirx $T$, which obviously has a higher optimal objective value than the continous case.