Understanding mutual information derivation

information theorystatistics

The mutual information between the joint and marginal gives this proof:

$$I(X;Y) = D(p(x,y) || p(x)p(y))\\
… \\
\sum_{x,y}p(x,y) log p(x,y) – \sum_{x,y}p(x,y) log p(x) – \sum_{x,y}p(x,y) log p(y)
$$

Now the proof turns each component into entropy:

$$\sum_{x,y}p(x,y) log p(x) = H(X) \\
\sum_{x,y}p(x,y) log p(y) = H(Y)
$$

I don't understand how that happens for these two. For $H(X)$, what's inside the log doesn't match what's right next to it… therefore it isn't the same distribution?…

Best Answer

Observe that by the law of total probability $\sum_y p(x,y)=p(x)$, hence \begin{align*} \sum_{x,y} p(x,y) \log p(x) &= \sum_{x} p(x) \log p(x)\\ &=-H(X) \end{align*} The same happens for the other one.