Gradient of Wasserstein Distance in Otto’s Calculus


I am learning the idea of "gradient" of a functional in Otto's calculus. It is defined as follows.

Suppose the space we are thinking about is $(\mathcal{P}_{2,AC}(\mathbb{R}^d),W_2)$, the space of probability measures with finite second moment that is absolutely continuous w.r.t. Lebesgue measure, and equipped with 2-Wasserstein distance. So later on we use the density $\rho$ instead of measure in this space. The "tangent space" at $\rho$ is defined to be $T_{\rho}\mathcal{P}_{2,AC}(\mathbb{R}^d)=\{f:\int fdx=0\}$. A vector $v=\nabla\phi$ is said to be "coupled" with some $f\in T_{\rho}\mathcal{P}_{2,AC}(\mathbb{R}^d)$ if they satisfy the equation $$-\nabla\cdot(\rho\nabla\phi)=f.$$

This definition of "couple" is consistent with the idea of "continuity equation" in optimal transport. The Otto's metric tensor is then defined as $$<f,f'>_{\rho}:=\int \rho\nabla\phi\cdot\nabla\phi'dx,$$

with $\phi,\phi'$ coupled with $f,f'\in T_{\rho}\mathcal{P}_{2,AC}(\mathbb{R}^d)$ resp.

Having these in hand we can definethe gradient of a functional $F(\rho)$: $\mathop{grad} F(\rho)\in T_{\rho}\mathcal{P}_{2,AC}(\mathbb{R}^d)$ is the function such that $$<\mathop{grad}F(\rho),f>_{\rho}=d_{\rho}F(f)=\frac{d}{dt}\mid_{t=0}F(\rho_t),$$

for every $f\in T_{\rho}\mathcal{P}_{2,AC}(\mathbb{R}^d)$, and $\rho_t$ is any curve such that $\rho_0=\rho$ and $\rho'(0)=f$.

By this definition we can easily calculate out that eg. when $F(\rho)=\int\rho\log\rho dx$ being the entropy functional, $\mathop{grad}F(\rho)=-\Delta\rho$.

My question is now what is the gradient of the functional $F(\rho)=W_2^2(\rho dx,\eta dx)$ for any given $\eta dx\in \mathcal{P}_{2,AC}(\mathbb{R}^d)$. My guess is that, if $\phi_*$ is the unique Kantorovich potential(the solution to the dual Kantorovich problem between $\rho dx$ and $\eta dx$ with quadratic cost), then we may have $$\mathop{grad}W_2^2(\rho,\eta)=-\nabla\cdot(\rho\nabla\phi_*),$$

since if we represent the Waserstein distance in terms of dual problem, then $$W_2^2(\rho,\eta)=\max_{\phi(x)+\phi^c(y)\leq |x-y|^2}\int\phi\rho dx+\int\phi^c\eta dx.$$

If there is no maximum in the formula(which means $\phi$ is fixed), then the above formula surely holds. Now I am thinking that whether we can use the stability of optimal transport map or something to prove that the guess is correct, but I don't know how to make that work.

Best Answer

Yes this is true, formally this follows by the envelope theorem. In an abstract and very smooth setting, the envelope theorem says that for an objective functional depending on a parameter $t$ $$ F(t)=\max\limits_z f(t,z), $$ then the derivative of the optimal value can be computed as $$ \frac{dF}{dt}(t)=\partial_t f(t,z_t) \qquad \mbox{for any smooth selection of a maximizer }z_t \mbox{ of }F(t). $$ This can be seen easily: for any such choice of a maximizer, just apply a chain rule and use the optimality condition of $z_t$ in the maximization problem for fixed $t$: $$ F'(t)=\frac d{dt}f(t,z_t)=\partial_t f(t,z_t)+\underbrace{\partial_zf(t,z_t)}_{=0}\frac {dz_t}{dt}. $$ This means, roughly speaking, that one can simply forget that the minimizer varies, only the variation of the functional matter.

In your specific context, you are trying to differentiate (w.r.t $\rho$) the optimal value of the optimization problem given by the Kantorovich dual formulation $$ W^2(\rho,\eta) =F_\eta(\rho) =\max\limits_\phi f_\eta(\rho,\phi) =\max\limits_\phi \left\{\int \rho\phi+\int\eta\phi^c\right\} $$ (here $\eta$ is fixed once and for all, I'm mimicking my $F,f$ notations above to give some perspective and I hope the notation is sufficiently self-explanatory). Although the Kantorovich potential $\phi$ from $\rho$ to $\eta$ (the optimizer) varies when $\rho$ varies, the envelope theorem strongly suggests that you can actually argue as it did not vary at leading order (same for its $c$-transform $\phi^c$), and one can simply differentiate the functional w.r.t. the varying "parameter" $\rho$. Since the Kantorovich functional is linear in $\rho$, the conclusion is indeed that the first variation is given by $ \frac{\partial f_\eta}{\partial_\rho}(\rho,\phi)=\phi$. Of course various subtle problems may arise owing essentially to the infinite-dimensional setting and functional-analytic details, but this is the rough idea.

For a completely rigorous statement and proof I can recommend Filippo Santambrogio's book [1], in particular chapter 7 and Proposition 7.17

[1] Santambrogio, Filippo. "Optimal transport for applied mathematicians." Birkäuser, NY 55.58-63 (2015): 94.

