[Math] chain rule conditional entropy

entropymachine learningprobability

I have to prove the chain-rule for conditional entropy. I kept getting stuck on one step, so I looked up a proof and found this:
\begin{align}H(Y\mid X)&= \sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log \frac {p(x)} {p(x,y)}\tag{1}\\&= -\sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log\,p(x,y) + \sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log\,p(x) \tag{2}\\
&= H(X,Y) + \sum_{x \in \mathcal X} p(x)\log\,p(x)\tag{3}\\
&= H(X,Y) – H(X)\tag{4}\end{align}

This is identical to the proof I had made on my own, but it makes a step that I didn't think was allowed. Specifically, can someone explain how you are able to move from the joint probability to the marginal probability between steps 2 and 3? It seems like I'm missing something basic and obvious, but I don't see it.

Best Answer

The question is phrased as whether $$ \sum_{x\in\mathcal X, y\in\mathcal Y}p(x,y)\log\,p(x) = \sum_{x \in \mathcal X} p(x)\log\,p(x) \quad \text{?} $$

I disapprove of using the same symbol, $p$, for two different functions. If one instead writes $p_X(x)$ with capital $X$ and lower-case $x$ in the appropriate places, one can then understand such things as the difference between $p_X(4)$ and $p_Y(4)$, and the meaning of $p_X(4)$, and things like $\Pr(X\le x)$.

We have \begin{align} & \sum_{\begin{smallmatrix} x\in\mathcal X \\ y\in\mathcal Y \end{smallmatrix}} \Pr(X=x\ \&\ Y=y) \log p(x) & & \text{where $p(x)$ is some function of $x$} \\[10pt] = {} & \sum_{x\in\mathcal X} \left( \sum_{y\in\mathcal Y} \Pr(X=x\ \&\ Y=y) \log p(x) \right) & & \text{where $p(x)$ is some function of $x$.} \end{align}

The factor $\log p(x)$ within the sum $\displaystyle\sum_{y\in\mathcal Y}$ does not depend on $y$, i.e. it does not change as $y$ runs through the list of all members of $\mathcal Y$. Therefore we can pull it out, getting this: $$ \sum_{x\in\mathcal X} \left( (\log p(x)) \sum_{y\in\mathcal Y} \Pr(X=x\ \&\ Y=y) \right) $$ Now all we need to do is show that $$ \sum_{y\in\mathcal Y} \Pr(X=x\ \&\ Y=y) = \Pr(X=x). $$ For example, if $\mathcal Y=\{y_1,y_2,y_3\}$, we would need to show that $$ \Pr(X=x\ \&\ Y=y_1) + \Pr(X=x\ \&\ Y=y_2) + \Pr(X=x\ \&\ Y=y_3) = \Pr(X=x). $$ Can you do that?