Besides the obvious difference pointed out in the comments by @Joe:
I think the key difference is that the KL divergence represents the relative difference between two probability measures (recall that the absolutely continuity property ($p \ll q$) is required for KL divergence to be finite), while the difference you mentioned is just a difference between entropies of two completely different probability spaces, which gives no clue about how much the distributions are close to each other.
We have the following closely related notions:
- entropy (the information value)
- probability distribution (which outcomes should we already expect?)
- uncertainty (are we certain of the outcome, or will we learn something)
Low entropy
When we get a highly expected piece of information, we are already almost certain of the content and gained hardly any information value. Hence high probability, low uncertainty, low entropy.
Similarly, when we do NOT get a very unexpected piece of information, we were almost certain not to get it. NOT getting it contains low information value. Hence low probability, low uncertainty, low entropy.
High entropy
When we get a highly unpredictable uniformly random "flip of a coin"-like piece of information, we did not expect it and were quite uncertain of what it would be. The information value is very high! Hence almost 50/50 probability, high uncertainty, high entropy.
Example
Suppose you were to guess an English word. Then consider the expected value of getting answers to the following questions:
- Does it contain the letter "E"?
- Does it contain the letter "Z"?
You should expect the answers to be "MAYBE" and "NO". A randomly chosen English word has around $p\approx1/8=12.5\%$ probability of containing the letter "E" whereas "Z" is quite rare (let us say $p\approx1/64$). If we use those figures, we have:
$$
\begin{align}
I[E]&=
-1/8\cdot\log_2(1/8)=3/8&&=0.375\\
I[not\ E]&=-7/8\cdot\log_2(7/8)&&\approx0.169\\
H[E\ not\ E]&=I[E]+I[not\ E]&&\approx0.543
\end{align}
$$
and
$$
\begin{align}
I[Z]&=
-1/64\cdot\log_2(1/64)=6/64&&\approx0.094\\
I[not\ Z]&=-63/64\cdot\log_2(63/64)&&\approx0.022\\
H[Z\ not\ Z]&=I[Z]+I[not\ Z]&&\approx0.116
\end{align}
$$
Hence we will most likely learn something (dividing our candidate words into 1/8 portions) when getting an answer to the first question, whereas we will probably not learn much from confirming that "Z" is not contained in the word, only excluding a very small set of candidates (1 in 64).
Ideal type of question
A yes/no or true/false question like this has the potential of bisecting the candidate space into equal parts, so if we could ask the right question and be sure to either include or exclude half of the candidate words we would gain 1 bit of information. The ideal type of question should have a coin flip 50/50 probability.
Best Answer
Of course, things will depend on the precise perspective you take, but here's a pretty natural view from channel coding. Take a $q$-ary symmetric channel, i.e., $$ P(y = x'|X = x) = \begin{cases} 1-\pi & x' = x\\ \pi/(q-1) & x' \neq x\end{cases}.$$
The capacity of this channel is $$ \max_{p_X} H(Y) - H(Y|X).$$ Here $H(Y|X)$ is a constant $$ H(Y|X) = -(1-\pi) \log (1-\pi) + \pi\log(q-1) - \pi \log\pi.$$
Of course, placing the uniform law on $X$ further induces the uniform law on $Y$, so the capacity is $$ \log q - \pi \log(q-1) + \pi \log \pi + (1-\pi) \log(1-\pi).$$ Normalising this by $\log q$, the rate needed to communicate a uniform random variable, we see that the capacity satisfies $$ C = \log q \left( 1 - (\pi \log_q(q-1) - \pi \log_q(\pi) - (1-\pi) \log_q(1-\pi))\right).$$
Compare this to the capacity of a BSC $1 - h_2(\pi)$, and you see the similarity. The point of course is that if you're working with a q-ary alphabet, the q-ary SC is a basic model that you'll deal with often, so it's useful to have a generalisation of $h_2(\pi)$ that is pertinent here.