Probability – Proof of Sub-Additivity for Shannon Entropy

entropyprobability

in A Mathematical Theory of Communication (CE Shannon, 1948), the entropy of a categorical random variable is defined as:
$$H(X)=-\sum_{i}P(X=i)\log P(X=i)$$
while the joint entropy of two such variables is defined as:
$$H(X,Y)=-\sum_{i,j}P(X=i,Y=j)\log P(X=i,Y=j)$$
Then it is stated (p.12) that "It is easily shown that":
$$H(X,Y)\leq H(X)+H(Y)$$
with equality for independence.

I believe this property is referred to as sub-additivity, and I'm wondering what this "easy" way to prove it might be.

What I have tried so far:

I believe, using the Law of Total Probability, we can get
$H(X)+H(Y)=-\sum_{i,j}P(X=i,Y=j)\log (P(X=)P(Y=j))$ which would establish the equality for independence, but I don't know how to get the inequality. That would seem to require $P(X=i,Y=j)\leq P(X=i)P(Y=i)$ which is not always true.

I've found a source that proves the inequality using two other properties (Chain Rule and Dropping the Conditional). But these are introduced later in Shannon's paper so can't be the "easy" proof he had in mind?

Best Answer

Define the relative entropy (or Kulback-Leibler divergence) of the probabilities $P$ and $Q$, with $P\ll Q$, as $$D(P\|Q) = \sum_{x\in \Omega} p(x) \log \frac{p(x)}{q(x)}. $$ Using the inequality $\log a \leq a-1,$ for $ a>0$, it is easy to show that $$D(P\|Q) \geq 0.$$ (Just take $a=\frac{q(x)}{p(x)}$ with $p(x)>0$).

But then \begin{align} H(X)+H(Y) - H(X,Y) &= \sum_{x,y\in \Omega} p(x,y)\log \frac{p(x,y)}{p_X(x)p_Y(y)} \\ &=D(P_{X,Y}\| P_X \otimes P_Y)\\ & \geq 0. \end{align} where $P_X$ and $P_Y$ are the marginal distributions of $X$ and $Y$, and the equality holds if $X,Y$ are independent.