[Math] Relation between cross entropy and joint entropy

coding-theoryentropyinformation theoryoptimization

Definition of cross entropy (Wiki link for details):
$$H(p,q) = H(p) + \mathcal{D}_{KL}(p||q)$$

Definition of joint entropy:
\begin{align*}
H(X,Y) &= -\sum_x \sum_y p(x,y) \log p(x,y)\\
&= H(X) + H(Y|X)
\end{align*}

What are the differences between these two? I'm having difficult to distinguish the concept of entropy for random variables versus the concept of entropy for distributions?

Are there any connections between these two? E.g., can we use joint entropy to prove cross entropy formula, if not, how would you derive cross-entropy formula?

Best Answer

Unfortunately, practitioners use nearly identical notation for both cross entropy and joint entropy. This adds to the confusion. I will distinguish the two by using
$$H_q(p) - \text{for cross entropy}$$ and $$H(x,y) - \text{for joint entropy}$$

Cross Entropy

Cross Entropy tells us the average length of a message from one distribution using the optimal coding length of another. For example,

$$H_q(p) = \sum_{x} p(x) \log\bigg(\frac{1}{q(x)}\bigg)$$

Here, $\log\big(\frac{1}{q(x)}\big)$ is the optimal coding length for messages coming from the $q$ distribution.

While $p(x)$ is the cost of sending message $x$ from $p$.

Putting these together we can interpret $H_q(p)$ as the average cost of sending messages from $p$ using the optimal coding length for $q$.

Joint Entropy

Joint Entropy tells us the average cost of sending multiple messages simultaneously. Or perhaps more intuitively, the average cost of sending a single message that has multiple parts. For example,

$$H(x,y) = \sum_{x,y} p(x,y) \log \bigg(\frac{1}{p(x,y)} \bigg)$$

Here, $\log \big(\frac{1}{p(x,y)} \big)$ is the optimal coding length for messages coming from the $p$ distribution.

While $p(x,y)$ is the cost of sending the message $x,y$ from $p$.

It may be clear from this that joint entropy is merely the extension of entropy to multiple variables. In addition, multiple random variables are often represented as vectors $\bf x$. In which case, calculating the entropy of its distribution $p({\bf x})$ results in whats 'appears' to be a 'regular' entropy.

$$H(p)=\sum_{{\bf x}}p({\bf x})\log \bigg(\frac{1}{p({\bf x})} \bigg)$$

I make this last point to demonstrate that distinguishing between entropy and joint entropy may not be very useful.

Further Resources

Chris Olah wrote an excellent article on Information Theory with the goal of making things visually interpretable. It is called Visual Information Theory. The notation I adopted came from him. The distinction and relation between cross entropy and joint entropy is demonstrated via figures and analogies. The visualizations are very well done, such as the following which demonstrates why cross entropy is not symmetric.

enter image description here

Or this one which depicts the relationship between joint entropy, entropy, and conditional entropy.

enter image description here