Is the Square Root of the Kullback-Leibler Divergence a Convex Map? – Probability and Information Theory

convex-analysisconvexityit.information-theorypr.probability

$\newcommand{\KL}{\operatorname{KL}}$Let $X$ be a Polish metric space and $P(X)$ the space of probability measures on $X$. Given $\mu, \nu\in P(X)$, recall that
$$\KL(\mu\parallel\nu) = \begin{cases}\mathbb E_\mu[\log\tfrac{d\mu}{d\nu}]&\text{if $\mu\ll\nu$;}\\+\infty&\text{otherwise.}\end{cases}$$
I know that both $\mu\mapsto \KL(\mu\parallel\nu)$ and $\nu\mapsto \KL(\mu\parallel\nu)$ are convex maps $P(X)\to\mathbb [0,+\infty]$. Are
$$\mu\mapsto \sqrt{\KL(\mu\parallel\nu)}$$
and
$$\nu\mapsto \sqrt{\KL(\mu\parallel\nu)}$$
convex as well?

If this is not true in general, does it exists a measure $\mu\in P(X)$ such that $\nu\mapsto \sqrt{\KL(\mu\parallel\nu)}$ is a convex map $P(X)\to \mathbb [0,+\infty]$? Or a measure $\nu$ such that $\mu\mapsto \sqrt{\KL(\mu\parallel\nu)}$ is convex?

Best Answer

$\newcommand\de\delta\newcommand{\KL}{\operatorname{KL}}\newcommand{\p}{\,\|\,}$The maps $$\mu\mapsto\sqrt{\KL(\mu\p\nu)}$$ and $$\nu\mapsto\sqrt{\KL(\mu\p\nu)}$$ are not convex in general.

Indeed, let $\mu_p:=p\de_0+(1-p)\de_1$, where $p\in(0,1)$ and $\de_a$ is the Dirac measure supported on $\{a\}$.

Then the second partial derivative with respect to $p$ of $\sqrt{\mathrm{KL}(\mu_p\p\mu_r)}$ at $(p,r)=(1/10,1/11)$ is $-7.17\ldots<0$. So, $\sqrt{\mathrm{KL}(\mu\p\mu_r)}$ is not convex in $\mu$.

Also, the second partial derivative with respect to $r$ of $\sqrt{\mathrm{KL}(\mu_p\p\mu_r)}$ at $(p,r)=(1/10,1/9)$ is $-11.50\ldots<0$. So, $\sqrt{\mathrm{KL}(\mu_p\p\nu)}$ is not convex in $\nu$.

You also asked: "If this is not true in general, does it exists a measure $\mu\in P(X)$ such that $\nu\mapsto \sqrt{\mathrm{KL}(\mu\p\nu)}$ is a convex map $P(X)\to \mathbb [0,+\infty]$? Or a measure $\nu$ such that $\mu\mapsto \sqrt{\mathrm{KL}(\mu\p\nu)}$ is convex?"

The answer to each of these two questions is yes, at least when $X=\{0,1\}$, say.

For $p\in(0,1)$, let
\begin{equation} F(p):=\sqrt{\mathrm{KL}(\mu_p\p\mu_{1/2})}, \end{equation} \begin{equation} f(p):=F''(p)4 \mathrm{KL}(\mu_p\p\mu_{1/2})^{3/2}, \end{equation} \begin{equation} f_1(p):=f'(p)(1-p)^2 p^2. \end{equation} Then $f_1(1/2)=f'_1(1/2)=f''_1(1/2)=0$ and \begin{equation} f'''_1(p)=\frac{2+4 p(1-p)}{(1-p)^2 p^2}>0. \end{equation} It follows that $F''(p)\ge0$, so that $\sqrt{\mathrm{KL}(\mu\p\mu_{1/2})}$ is convex in $\mu$.

For $r\in(0,1)$, let
\begin{equation} G(r):=\sqrt{\mathrm{KL}(\mu_{1/2}\p\mu_r)}, \end{equation} and \begin{equation} g(r):=G''(r)4\mathrm{KL}(\mu_{1/2}\p\mu_r)^{3/2}. \end{equation} Then $g(1/2)=g'(1/2)=g''(1/2)=g'''(1/2)=0$ and \begin{equation} g''''(1/2+h)\frac{(1 - 4 h^2)^4}{16}=9- 16 h^4 + 156 h^2 + 64 h^6 \\ >9- 1 + 156 h^2 + 64 h^6>0 \end{equation} if $|h|<1/2$. It follows that $G''(r)\ge0$, so that $\sqrt{\KL(\mu_{1/2}\p\nu)}$ is convex in $\nu$.

Remark 1: The existence of a probability measure $\nu\in P(X)$ such that $\sqrt{\KL(\mu\p\nu)}$ is convex in $\nu$ holds for any Polish space $X$. Indeed, the case when $X$ is a singleton set is trivial. Otherwise, take any distinct points $x$ and $y$ in $X$ and let $\nu:=\frac12\,\de_x+\frac12\,\de_y$. Then the condition $\mu\ll\nu$ implies $\mu=p\,\de_x+(1-p)\,\de_y$ for some $p\in[0,1]$, so that here $\sqrt{\KL(\mu\p\nu)}=F(p)$.

Remark 2: Let us say that a probability measure $\mu\in P(X)$ is good if $\sqrt{\KL(\mu\p\nu)}$ is convex in $\nu$. Then it is easy to see that the Dirac measure $\de_x$ is good for any $x\in X$. It also follows from above that $\mu:=\frac12\,\de_x+\frac12\,\de_y$ is good if $X=\{x,y\}$ for some $x,y$.

Finally, using reasoning similar to that above, one can show that, in the case when the cardinality of $X$ is $\ge3$, the only good probability measures $\mu\in P(X)$ are the Dirac measures. Since this answer has already grown rather long, I will leave the latter assertion as an exercise to interested readers.

Now the answer is quite complete.

Best Answer

Related Solutions

[Math] Generalisations of the Kullback-Leibler divergence for more than two distributions

Reference Request – Error and Stability Estimates for Information Projection

Related Question