The difference between KL Divergence and merely subtracting entropy measurements

entropyinformation theorystatistics

I'm wondering what the difference between KL divergence and just subtracting one entropy measurement from another is. I can see what the difference is mathematically…but I'm having a hard time grasping what it means. Throughout this I'm thinking of entropy measurements of English text, but I'm not sure that matters.

Consider that you have two models for some random variable X, p (the good model), and q (the less-good model).

If we compute entropy $$H(X) = -\sum_{x\in\mathcal{X}} p(x) \log(p(x)) = let's\,say\,2\,bits\,per\,character$$
and
$$H(X) = -\sum_{x\in\mathcal{X}} q(x) \log(q(x)) = 3\,bits\,per\,character$$
we can obtain a difference measure simply by subtracting:

$$H_{diff}(p(x),q(x)) = -\sum_{x\in\mathcal{X}} q(x) \log(q(x)) – -\sum_{x\in\mathcal{X}} p(x) \log(p(x)) = 1\,bit\,per\,character$$

simplified a bit to be analogous in form to a common statement of KL Divergence…
$$H_{diff}(p(x),q(x)) = \sum_{x\in\mathcal{X}} p(x)\log(p(x)) – q(x)\log(q(x)) = 1\,bit\,per\,character$$

Okay, so that makes a sort of intuitive sense. If we've got 1000 characters then on average for model q we'll see 3000 bits, whereas on average for model p we'll see 2000 bits. Got it, I think?

But we can also use KL divergence, which is very similar:

$$D_\text{KL}(p \parallel q) = \sum_{x\in\mathcal{X}} p(x) \log\left(\frac{p(x)}{q(x)}\right)$$

Rearrange a bit…

$$D_\text{KL}(p \parallel q) = \sum_{x\in\mathcal{X}} p(x)\log(p(x)) – p(x)\log(q(x)) $$

So, really the only difference between simply taking the difference of the two entropy calculations and KL Divergence is p(x) vs. q(x) in the second term on the right-hand side, which is cross-entropy in the case of KL divergence and regular old entropy in my first example.

I've been thinking about it for a while here and I can't come up with an answer. What, intuitively, is the difference between these two distance measures?

Best Answer

Besides the obvious difference pointed out in the comments by @Joe:

I think the key difference is that the KL divergence represents the relative difference between two probability measures (recall that the absolutely continuity property ($p \ll q$) is required for KL divergence to be finite), while the difference you mentioned is just a difference between entropies of two completely different probability spaces, which gives no clue about how much the distributions are close to each other.