[Math] Prove that that the mutual information $I(X;Y)$ is convex in the conditional entropy $p(y |x)$

convex-analysisinformation theoryself-learning

I'm trying to understand the proof that $I(X;Y)$ is convex in conditional distribution $p(y \mid x)$ – from Elements of Information Theory by Cover & Thomas, theorem 2.7.4.

In the proof we fix $p(x)$ and consider two conditionals $p_1 (y \mid x)$ and $p_2 (y \mid x)$. Corresponding joints are $p_1(x, y) = p(x) \, p_1 (y \mid x)$ and $p_2(x, y) = p(x) \, (y \mid x)$ and marginals are $p_1(y)$ and $p_2(y)$.

Then we consider conditional $p^*(y \mid x) = \lambda \, p_1 (y \mid x) + (1 – \lambda) \, p_2 (y \mid x)$, joint $p^*(x, y) = \lambda \, p_1 (x, y) + (1 – \lambda) \, p_2 (x, y) = \lambda \, p(x) \, p_1 (y \mid x) + (1 – \lambda) \, p(x) \, p_2 (y \mid x)$ and marginal $p^*(y) = \lambda \, p_1 (y) + (1 – \lambda) \, p_2 (y)$.

If we let $q^*(x, y) = p(x) \, p^*(y) = \lambda \, p(x) \, p_1 (y) + (1 – \lambda) \, p(x) \, p_2 (y)$, then the KL divergence between $p^*(x, y)$ and $q^*(x, y)$ is $D \big( p^*(x, y) \ || \ q^*(x, y) \big) = D \big( p^*(x, y) \ || \ p(x) \, p^*(y) \big) = I(X; Y)$ – for $X \sim p(x)$ and $Y \sim p^*(y)$

Next in the book they conclude the proof by saying that since $D( p \ || \ q)$ is convex in $(p, q)$, so is $I(X;Y)$. What I don't understand is why it means that $I(X; Y)$ is convex in $p(y \mid x)$?

Best Answer

First lets state the claim.

Claim: $I(X;Y)=D(p(x,y)||p(x)p(y))$ is concave in $p(x)$ for fixed $p(y|x)$ and convex in $p(y|x)$ for fixed $p(x)$.

One can write the following:

$I(X;Y)=H(Y)-H(Y|X)$ where $H$ is the entropy function. The entropy functions can further be written as

$$H(Y)=-\sum_y p(y)\log(p(y)) \ \text{ and } \ H(Y|X)=-\sum_x p(x)\sum_y p(y|x)\log (p(y|x))$$

What we know is that $H(Y)$ is a concave function of $p(y)$, because $x\log x$ is a convex function of $x$. We can write $p(y)=\sum_x p(y|x)p(x)$. Hence, whenever $p(y|x)$ is fixed, meaning that it is constant for all $p(y|x)$, $p(y)$ is a linear function of $p(x)$. Therefore $H(Y)$ is also concave in $p(x)$. For fixed $p(y|x)$, $H(Y|X)$ is also linear in $p(x)$. Now consider a function which is a sum of a linear and a concave function $f(x)=c(x)+l(x)$. Since $f^{''}(x)=c^{''}(x)$, where $(\cdot)^{''}$ is the second derivative, $f$ is also concave. Therefore, $I(X;Y)$ is a concave function of $p(x)$.

Now lets come to the second part. When $p(x)$ is constant, this time $p(y)=\sum_x p(y|x)p(x)$ is a linear function of $p(y|x)$. Hence, $H(Y)$ is a concave function of $p(y|x)$. Due to same reason $H(Y|X)$ is also a concave function. Difference of two concave functions can either be concave or convex or both. Going back to the definition $I(X;Y)=D(p(x,y)||p(x)p(y))$, we also know that $D(\cdot||p(x)p(y))$ is convex in $p(x)p(y)$, since KL divergence is convex in both terms. Now, $p(x)$ was constant and $p(y)$ was linear in $p(y|x)$. Therefore, the right choice was that $I$ is convex in $p(y|x)$, else $D$ cannot be convex in $p(x)p(y)$.

Another idea would be to show that $I(X;Y)$ accepts no maximum over $p(y|x)$, except for the boundaries. I dont know however how this can be shown. What I know is that $I(X;Y)$ is always upperbunded by $\min[ H(X),H(Y)]$, simply from the symmetry of $I(X;Y)=H(Y)-H(Y|X)=H(X)-H(X|Y)$.

Related Question