[Math] Alternative proof that conditioning reduces entropy

convex-analysisinformation theorystatistics

I was going through some class notes which provide a different proof that conditioning reduces entropy from the usual one which relies on the fact that mutual information is non-negative. In this proof, I am confused by the second step which has an inequality:
\begin{align}
H(X|Y) &= \sum_x \sum_y p_y \left(- p_{x|y} \log p_{x|y}\right) \\
&\le \sum_x \left\{\left(-\sum_y p_y \, p_{x|y}\right)\left(\log \sum_y p_y \, p_{x|y} \right)\right\} \\
&= \sum_x – p_x \log p_x \\
&= H(X).
\end{align}
Basically, I'm not sure how the 2nd set of parentheses ends up being a log-sum containing $p_y$.

My only thoughts are that maybe Jensen's inequality is being used here, or the Schwartz inequality, or some combination thereof, but I am not sure. Or is this just sloppy note-keeping?

Thanks.

Best Answer

You can proceed using the so-called log-sum inequality (see https://en.wikipedia.org/wiki/Log_sum_inequality). The log-sum inequality is defined as follows:

$\sum_{i}a_i\log{\frac{a_i}{b_i}}\geq (\sum_ia_i)(\log{\frac{\sum_ia_i}{\sum_ib_i}})$

Then for the case you are presenting, select $a_i=p_yp_{x|y}$ and $b_i=p_y$, and so the step you are referring to would be solved as

$\sum_x\sum_y-p_yp_{x|y}\log{p_{x|y}}=\sum_x\sum_y-p_yp_{x|y}\log{\frac{p_yp_{x|y}}{p_y}}\leq \sum_x\left\{(-\sum_yp_yp_{x|y})\left(\log{\frac{\sum_yp_yp_{x|y}}{\sum_yp_y}}\right)\right\}$ $= \sum_x\left\{(-\sum_yp_yp_{x|y})(\log{\sum_yp_yp_{x|y}})\right\}$.

Note that the inequalities change direction when negative is included in both sides, as it is the case here, so the log-sum inequality used would be

$-\sum_{i}a_i\log{\frac{a_i}{b_i}}\leq -(\sum_ia_i)(\log{\frac{\sum_ia_i}{\sum_ib_i}})$.

Related Question