[Math] Why is the Kullback-Leibler divergence not symmetric

statistics

As known the Kullback-Leibler Divergence:

$$\operatorname{KL}=\sum_{i=1}^n \ln(\frac{P(i)}{Q(i)})P(i)$$

is not symmetric.
I would like to know how this can be seen from the formula. I am aware that I could just try it out with exchaning Q and P for some special case, but I would like to know the mathematical reason behind it. Also, is actually here "i" the random variable?

Thanks a lot
Miau

Best Answer

In addition to the algebraic reason that Robert Israel gave, there's a very nice "moral reason" that the Kullback-Leibler divergence is not symmetric. Roughly speaking, it's because you should think of the two arguments of the KL divergence as different kinds of things: the first argument is empirical data, and the second argument is a model you're comparing the data to. Here's how it works.

Take a bunch of independent random variables $X_1, \ldots, X_n$ whose possible values lie in a finite set.* Say these variables are identically distributed, with $\operatorname{Pr}(X_i = x) = p_x$. Let $F_{n,x}$ be the number of variables whose values are equal to $x$. The list $F_n$ is a random variable, often called the "empirical frequency distribution" of the $X_i$. What does $F_n$ look like when $n$ is very large?

More specifically, let's try to estimate the probabilities of the possible values of $F_n$. Since the set of possible values is different for different $n$, take a sequence of frequency distributions $f_1, f_2, f_3, \ldots$ approaching a fixed frequency distribution $f$. It turns out** that $$\lim_{n \to \infty} \tfrac{1}{n} \ln \operatorname{Pr}(F_n = f_n) = -\operatorname{KL}(f, p).$$ In other words, the Kullback-Leibler divergence of $f$ from $p$ lets you estimate the probability of getting an empirical frequency distribution close to $f$ from a large number of independent random variables with distribution $p$.

You can find everything I just said, and more, in the excellent article "Information Theory, Relative Entropy and Statistics," by François Bavaud.


* You can also do this more generally, but I don't know anything about that.

** Using Stirling's approximation, $\ln k! \in k\ln k - k + O(\ln k)$.