Solved – Intuitively, why is cross entropy a measure of distance of two probability distributions

cross entropydistancedistributionsinformation theoryprobability

For two discrete distributions $p$ and $q$, cross entropy is defined as

$$H(p,q)=-\sum_x p(x)\log q(x).$$

I wonder why this would be an intuitive measure of distance between two probability distributions?

I see that $H(p,p)$ is the entropy of $p$, which measures "surprise" of $p$. $H(p,q)$ is the measure that partly replaces $p$ by $q$. I still do not understand the intuitive meaning behind the definition.

Best Answer

Minimizing the cross entropy is often used as a learning objective in generative models where p is the true distribution and q is the learned distribution.

The cross entropy of p and q is equal to the entropy of p plus the KL divergence between p and q.

$H(p, q) = H(p) + D_{KL}(p||q)$

You can think of $H(p)$ as a constant because $p$ comes directly from the training data and is not learned by the model. So, only the KL divergence term is important. The motivation for KL divergence as a distance between probability distributions is that it tells you how many bits of information are gained by using the distribution p instead of the approximation q.

Note that KL divergence isn't a proper distance metric. For one thing, it is not symmetric in p and q. If you need a distance metric for probability distributions you will have to use something else. But, if you are using the word "distance" informally then you can use KL divergence.