Solved – Intuitively, why is cross entropy a measure of distance of two probability distributions

cross entropydistancedistributionsinformation theoryprobability

For two discrete distributions $p$ and $q$, cross entropy is defined as

$$H(p,q)=-\sum_x p(x)\log q(x).$$

I wonder why this would be an intuitive measure of distance between two probability distributions?

I see that $H(p,p)$ is the entropy of $p$, which measures "surprise" of $p$. $H(p,q)$ is the measure that partly replaces $p$ by $q$. I still do not understand the intuitive meaning behind the definition.

Best Answer

Minimizing the cross entropy is often used as a learning objective in generative models where p is the true distribution and q is the learned distribution.

The cross entropy of p and q is equal to the entropy of p plus the KL divergence between p and q.

$H(p, q) = H(p) + D_{KL}(p||q)$

You can think of $H(p)$ as a constant because $p$ comes directly from the training data and is not learned by the model. So, only the KL divergence term is important. The motivation for KL divergence as a distance between probability distributions is that it tells you how many bits of information are gained by using the distribution p instead of the approximation q.

Note that KL divergence isn't a proper distance metric. For one thing, it is not symmetric in p and q. If you need a distance metric for probability distributions you will have to use something else. But, if you are using the word "distance" informally then you can use KL divergence.

Best Answer

Related Solutions

Solved – Definition and origin of “cross entropy”

Information Theory – Understanding Cross Entropy in Qualitative Terms

Related Question