Information Theory – Expressing Conditional Entropy with Kullback-Leibler Divergence

entropyinformation theory

I'm going through "Elements of Information Theory" by Cover and Thomas and there the conditional entropy is proven to be equal to:
\begin{align}H(Y\mid X)=-\sum_{x\in\mathcal{X}} \sum_{y\in\mathcal{Y}} p(x,y) \log_2p(y \mid x).\end{align}

Another well known formula is the formula for mutual information:
\begin{align}I(X;Y)&=\sum_{x\in\mathcal{X}} \sum_{y\in\mathcal{Y}} p(x,y)\log_2\frac{p(x,y)}{p(x)p(y)} \\
&=E_{p(x,y)}\log_2\frac{p(X,Y)}{p(X)p(Y)} \\
&=D_{KL}(p(x,y)\parallel p(x)p(y)).\end{align}

Following the same reasoning I thought that we could also write (since $p(y\mid x)=p(x,y)/p(x)$):
\begin{align}H(Y\mid X)&=-\sum_{x\in\mathcal{X}} \sum_{y\in\mathcal{Y}} p(x,y) \log_2 \frac{p(x,y)}{p(x)}\\
&=-E_{p(x,y)}\log_2\frac{p(X,Y)}{p(X)}\\
&=-D_{KL}(p(x,y)\parallel p(x)).\end{align}

Is my reasoning correct?

I can't seem to find that equality listed anywhere. Wikipedia lists a couple of other identities which relate Kullback-Libler divergence and conditional entropy, but no mention of this, so I suspect that I am mistaken somewhere.

Best Answer

The last equality in your derivation is not correct. Note that the KL divergence $D_\text{KL}(p\|q)$ is only meaningful when the two distributions involved, $p$ and $q$, are defined over the same space. A quantity such as $D_\text{KL}(p(x,y)\|p(x))$ makes no sense as it involves the pdf $p(x,y)$, which is defined over $\mathcal{X}\times \mathcal{Y}$, and the pdf $p(x)$, which is defined over $\mathcal{X}$.

Related Question