Intuition behind entropy-constrained transportation problem

The intuition behind an optimal transportation problem is rather clear. One aims at finding the best transportation plan $\gamma^{\star}(x,y)$ to match 2 distributions $p_s$ and $p_t$. The cost of the optimal plan gives the Wasserstein-distance:
$$\text{WD}(p_s, p_t) = \text{inf} _{\gamma \in \Pi } \sum C(x,y)\gamma(x,y)$$
where $\Pi$ denotes all distributions, whose marginals are $p_t, p_s$ and $C$ denotes a cost function.

The same problem is considered in literature with an entropy constraint, i.e. one considers:
$$ \text{inf} _{\gamma \in \Pi } \sum C(x,y)\gamma(x,y)- \alpha H(\gamma)$$
Where $H(\gamma) = -\sum \gamma \text{ log}(\gamma)$ is the information entropy. I read in literature that this extra term makes the computation of optimal transportations easier but I don't see the intuition behind this term. What does the entropy of a transportation plan represent? How adding this term is affecting the optimal plan? Can someone explain it in the language of information theory? or maybe using the famous example of matching two piles of dirt?

Best Answer

First of all, notice that for $\alpha=0$ we get our usual optimal cost, which is a Wasserstein distance if our cost function $C$ is the ground metric of our space to some integer power.

As $\alpha$ gets bigger though, the solution of the problem gets further and further away from the Wasserstein distance.

Now let's look at the transport plan $\gamma$ where the infimum is attained:

For $\alpha=0$ the infimum is attained for our optimal plan.
For $\alpha\to\infty$ however, the infimum is attained when $\gamma$ has maximal entropy.

One can show that the plan (or coupling) with the highest entropy is $p_s\otimes p_t$, the independent joint distribution of $p_s$ and $p_t$, therefore the problem can be rewritten in terms of the KL-Divergence from $\gamma$ to $p_s\otimes p_t$:

$$\inf_\gamma \left(\sum C(x,y)\gamma(x,y)\right) +\left(\alpha \text{D}_\text{KL}\left(\gamma,p_s\otimes p_t\right)\right)$$

So why do we care about $\gamma$ being closer to $p_s\otimes p_t$?

We usually don't want that, but $p_s\otimes p_t$ at least already marginalizes to $p_s$ and $p_t$, i.e. it is a transport plan, albeit usually not a good one.

The Sinkhorn algorithm (or scaling algorithm/IPFP) iteratively ensures that these constraints are satisfied, and numerically solves the entropically regularized problem. However, in that algorithm you have to divide by $\alpha$, so $\alpha$ can't be $0$.

In terms of piles of dirt you can look at the following picture from the Computational Optimal Transport book:

As $\alpha$ (here $\varepsilon$) gets greater you're less and less sure where to transport which grain of dirt to, one could say your transport plan gets more chaotic.

Best Answer

Related Solutions

Measure Theory – Wasserstein Distances Metrize Weak Convergence

Simple bounds of the Wasserstein distance

Related Question