Different versions of the entropy term in the entropy-regularized Wasserstein distance

entropyoptimal-transportprobability distributionsprobability theorystatistics

\begin{equation}
\mathcal{W}_\epsilon(\alpha, \beta) = \min_{\pi\in \Pi(\alpha\beta)} \int c(x,y) \mathrm{d}\pi(x,y) + \epsilon H(\pi \| \alpha \otimes \beta)
\end{equation}
Cuturi (2013) introduced the entropy-regularized Wasserstein distance, or Sinkhorn distance, shown above, where $\epsilon $ is the regularization parameter and $H(\pi \| \alpha \otimes \beta)$ is the relative entropy, or KL-divergence, between the transport plan and the marginal probabilities.

But I have seen the $H(\cdot)$ term shown in two different ways, one with entropy and the other with relative entropy:

\begin{align}
H(\pi) &= \int \pi(x,y) \ln \pi(x,y) \\
H(\pi \| \alpha \otimes \beta) &= \int \ln \left(\frac{\mathrm{d}\pi (x,y)}{\mathrm{d}\alpha(x) \mathrm{d}\beta(y) } \right) \mathrm{d}\pi (x,y)
\end{align}

How are the last two lines equal or connected to each other? Obviously they're not the same, so why are there two different versions running around?

Best Answer

These two are actually equivalent up to a constant when $\pi$ is a coupling of $\alpha$ and $\beta$. I'll assume that $\pi,\alpha, \beta$ all have densities. We can then write:

$$ H(\pi||\alpha\otimes \beta) = \int\ln\left(\frac{d\pi}{d\alpha d\beta} \right)d\pi = \int \pi(x,y) \ln\left(\frac{\pi(x,y)}{\alpha(x)\beta(y)} \right) dx dy $$

Note that $\pi(x,y)$ is the density with respect to the Lebesgue measure, and the same can be said for $\alpha(x)$ and $\beta(y)$. Therefore:

$$ H(\pi||\alpha\otimes \beta) = \int\pi(x,y)\ln \pi(x,y) dx dy - \int\pi(x,y)\ln(\alpha(x))dxdy - \int\pi(x,y)\ln(\beta(y))dxdy =\\ = \int \pi(x,y) \ln\pi(x,y) dx dy - \int\alpha(x)\ln\alpha(x) dx -\int \beta(y) \ln \beta(y) dy = H(\pi) - H(\alpha) - H(\beta) $$

Since $\alpha$ and $\beta$ are fixed, we get $H(\pi) + C$, where $C$ is a constant.

Related Solutions

Different linear programming versions of optimal transport

To reconcile both, note that $\bf x$ is obtained by stacking the columns of $\bf \Gamma$. For the cost, $\bf c$ is obtained by stacking the columns of the cost matrix $\bf C$. Also, let $\mathbb I_n$ be an $n\times n$ identity matrix. Therefore, make: $$ {\bf A} = \begin{bmatrix} {\bf 1}_n^\top \otimes \mathbb I_m\\ \mathbb I_n \otimes {\bf 1}_m^\top \end{bmatrix} $$

Finally, the constraint ${\bf x}\geq 0$ is the same as the constraint that the transport plan $\bf \Gamma$ must be positive. This can be found in page 38 in the book Computational Optimal Transport, by Peyré and Cuturi. The pdf of the book is freely avaiable.

The Wasserstein Metric. Computational Optimal Transport. Weights.

Just to support the response of Giacomo, you can use a generic Optimal Transport solver for the cost $c(x, y) = \langle A(x-y), x-y \rangle$. For instance, supposing empirical measures $\hat{\mu}$ and $\hat{\nu}$, with support $X = [x_{1}, x_{2}, \cdots, x_{n}]$ (resp. $Y$) and sample weights $\{a_{i}\}_{i=1}^{n}$ (resp. $\{b_{j}\}_{j=1}^{m}$), these are defined as,

$$\hat{\mu}(x) = \sum_{i=1}^{n}a_{i}\delta(x-x_{i}),$$ $$\hat{\nu}(y) = \sum_{i=1}^{m}b_{j}\delta(y-y_{j}),$$

You can check that using these definitions, the set $\Pi(\hat{\mu}, \hat{\nu})$ is finite (so we may substitute the inf by a min), and the optimization problem becomes:

$$\pi^{\star} = \underset{\pi \in \Pi(\hat{\mu}, \hat{\nu})}{\text{min}}\sum_{i=1}^{n}\sum_{j=1}^{m}C_{ij}\pi_{ij}$$

Now, as you may be aware, this is a (finite) linear program on the entries $\pi_{ij}$, for which there are known algorithms for solving it (e.g. Dantzig's algorithm). Sinkhorn's algorithm is a clever solution for approximating the solution $\pi^{\star}$. As Giacomo mentioned, "Computational Optimal Transport" presents a broad discussion on the subject.

Now, answering your question concerning the cost function, you can easily implement the cost on a programming language of your choice. All you need to note is that,

\begin{align*} C_{ij} &= c(x_{i},y_{j})\\ &= \biggr(A(x_{i}-y_{j})\biggr)^{T}(x_{i}-y_{j})\\ &= (x_{i}-y_{j})^{T}A^{T}(x_{i}-y_{j})\\ &= (x_{i}-y_{j})^{T}A(x_{i}-y_{j}) \end{align*}

Personally, I recommend using Python, as there are good toolboxes for Optimal Transport, such as Python Optimal Transport (POT).

Side-topic: Python implementation

You may be wondering why I wrote $C_{ij} = (x_{i}-y_{j})^{T}A(x_{i}-y_{j})$. Well, the lib Scipy of Python has a high-performance function for computing pairwise distances, called cdist. Reading the documentation of this function, note that the expression for $C_{ij}$ looks a lot like the Mahalanobis distance. Indeed, $A$ in the latter expression is the $VI$ argument in the function. Thus, you can automatically program your cost function as,

from functools import partial
from scipy.spatial.pairwise import cdist

A = ... # definition of your positive-definite matrix of shape (p, p)
a = ... # Sample weights of \hat{\mu}, with shape (n,)
b = ... # Sample weights of \hat{\mu}, with shape (m,)
X = ... # Support of \hat{\mu}, with shape (n, p)
Y = ... # Support of \hat{\nu}, with shape (m, p)
C = cdist(X, Y, metric='mahalanobis', VI=A) # Cost matrix with shape (n, m)
pi = ot_solver(a, b, C) # optimal transport plan with shape (n, m)
```

Best Answer

Related Solutions

Different linear programming versions of optimal transport

The Wasserstein Metric. Computational Optimal Transport. Weights.

Related Question