Fréchet derivative of the total variation norm for measures on a manifold

borel-measurescompact-manifoldsfrechet-derivativenon-convex-optimizationtotal-variation

Let $\Theta$ be a compact $d$-dimensional Riemannian manifold without boundary and $M(\Theta)$ (resp. $M_+(\Theta)$) denote the set of signed (resp. nonnegative) finite Borel measures on $\Theta$.

What is the Fréchet derivative of the total variation norm given below?
$$
\| \cdot \|_{\text{TV}} \colon M(\Theta) \to \mathbb{R}_{\ge 0}, \qquad
\mu \mapsto \| \mu \|_{\text{TV}}
$$

Is it even differentiable in $\mu = 0$?

Context: in the paper L. Chizat – Sparse Optimization on Measures with Over-parametrized Gradient Descent the following setting (A1) is considered:
Let $F$ be a Hilbert space and $\phi \colon \Theta \to F$ and $R \colon F \to \mathbb{R}$ each be twice Fréchet differentiable with locally Lipschitz second order derivatives such that $\nabla R$ is bounded on sublevel sets.
(Does this mean on the sublevel sets of $R$?)

Chizat claims on page 5 that

the objective
\begin{equation*}
J \colon M_+(\Theta) \to \mathbb R, \qquad
\nu \mapsto R\left(\int_{\Theta} \phi(\theta) \text{d}\nu(\theta)\right) + \lambda \| \nu \|_{\text{TV}},
\end{equation*}

which can easily be extended to $M(\Theta)$ (see Appendix A in that paper, which is also available on arXiv), is Fréchet differentiable and its differential at $\nu \in M(\Theta)$ can be represented by
$$
J^{'}_{\nu} \colon \Theta \to \mathbb{R}, \qquad
\theta \mapsto \left\langle \phi(\theta), \nabla R\left(\int_{\Theta} \phi(\theta) \text{d}\nu(\theta)\right) \right\rangle_{F} + \lambda
$$

in the sense that $\frac{d}{d \varepsilon} J(\nu + \varepsilon \sigma) \bigg|_{\varepsilon = 0} = \int_{\Theta} J_{\nu}^{'}(\theta) \text{d}\sigma(\theta)$.

Using that the Fréchet derivative is linear and just focussing on the second term (with $\lambda$), this would imply that $D \| \cdot \|_{\text{TV}}(\nu)[\sigma] = \| \sigma \|_{\text{TV}}$, where $D f(x)[h] \in Y$ is the Fréchet derivative of $f \colon X \to Y$ at $x \in X$ in direction $h \in X$.

We have that if $f$ is linear, then $D f(x)[h] = f(h)$ for all $x, h \in X$.
Does the converse also hold? If yes, this would imply that the total variation norm is linear, which is surely not true.

Best Answer

What you've written here is inconsistent with what I remembered from the paper. So I followed both links (Springer & arXiv). Your definition for $J$ here is not what's in the paper.

First, $J$ in page 5 of arXiv or 6 of Springer is defined on $M$ not $M_+$ ($M_+$ would be wrong here since it's not a vector space but $M$ is.) It's the optimization that is over $M_+$.

Second, the last term in $J$ is the total mass $\nu(\Theta)$ not $||\nu||$. The norm in a Banach space is not in general Fréchet differentiable. There it is differentiating $\nu\mapsto \nu(\Theta)$, not the total variation $\nu\mapsto ||\nu||$, w.r.t to $\nu$, which is 1. That is for $J(\nu)=\nu(\Theta)=\int_\Theta d\nu(\Theta)$, $J$ is linear on $M$ so its derivative is itself; $dJ_\nu(\sigma)=\sigma(\Theta)=\int_\Theta 1 d\sigma(\Theta)$, compare that with the definition of $J'_\nu(\Theta)$, you get $J'_\nu(\Theta)=1$.

Related Question