Entropy – Why KL-Divergence Uses ln in Its Formula

entropykullback-leibler

I notice in KL-Divergence formula a $ln$ function is used:

$${D_{KL}}(P||Q) = \sum\limits_i {P(i)} \ln \frac{{P(i)}}{{Q(i)}},$$
where $i$ is a point and $P(i)$ the true discrete probability distribution and $Q(i)$ is the approximate distribution. Can anyone help explain why the $ln$ function is used here?

Why it is not simply
$${D_{KL}}(P||Q) = \sum\limits_i {P(i)} \frac{{P(i)}}{{Q(i)}}?$$
Is there any special purpose?

Best Answer

This is somewhat intuitive, I hope I give some ideas.

The KL divergence has several mathematical meanings. Although it is used to compare distributions, it comes from the field of information theory, where it measures how much "information" is lost when coding a source using a different distribution other than the real one. In information theory, it can be also defined as the difference between entropies - the joint entropy of $Q$ and $P$ and the entropy of $P$.

So to discuss KL divergence, we need to understand the meaning of entropy. The entropy is the measure of "information" in a source, and generally describes how "surprised" you will be with the outcome of the random variable. For instance, if you have a uniform distribution, you will always be "surprised" because there is a wide range of variables it can take. It has high entropy. However, if the RV is a coin with $p=0.9$, then you will probably not be surprised, because it will succeed 90% of the times, so it has low entropy.

Entropy is defined as $H(X)=-\sum_x P(x)\log P(x)=E[-\log P(X)]$, which is the expectation of $I(X)$, the information of a source. Why the log? one reason is the logarithm property of $\log(xy)=\log(x)+\log(y)$, meaning the information of a source composed of independent sources ($p(x)=p_1(x)p_2(x)$) will have the sum of their information. This can only happen by using a logarithm.

Related Question