Kullback-Leibler Divergence – Comprehensive Analysis

information theoryinterpretationkullback-leibler

Let us consider the following two probability distributions

P       Q
0.01    0.002
0.02    0.004
0.03    0.006
0.04    0.008
0.05    0.01
0.06    0.012
0.07    0.014
0.08    0.016
0.64    0.928

I have have calculated Kullback-Leibler divergence which is equal $0.492820258$, I want to know in general what does this number shows me? Generally, Kullback-Leibler divergence shows me how far is one probability distribution from another, right? It is similar to entropy terminology, but in terms of numbers, what does it mean? If I have a result of result of 0.49, can I say that approximately one distribution is far from another by 50%?

Best Answer

The Kullback-Leibler Divergence is not a metric proper, since it is not symmetric and also, it does not satisfy the triangle inequality. So the "roles" played by the two distributions are different, and it is important to distribute these roles according to the real-world phenomenon under study.

When we write (the OP has calculated the expression using base-2 logarithms)

$$\mathbb K\left(P||Q\right) = \sum_{i}\log_2 (p_i/q_i)p_i $$

we consider the $P$ distribution to be the "target distribution" (usually considered to be the true distribution), which we approximate by using the $Q$ distribution.

Now,

$$\sum_{i}\log_2 (p_i/q_i)p_i = \sum_{i}\log_2 (p_i)p_i-\sum_{i}\log_2 (q_i)p_i = -H(P) - E_P(\ln(Q))$$

where $H(P)$ is the Shannon entropy of distribution $P$ and $-E_P(\ln(Q))$ is called the "cross-entropy of $P$ and $Q$" -also non-symmetric.

Writing

$$\mathbb K\left(P||Q\right) = H(P,Q) - H(P)$$

(here too, the order in which we write the distributions in the expression of the cross-entropy matters, since it too is not symmetric), permits us to see that KL-Divergence reflects an increase in entropy over the unavoidable entropy of distribution $P$.

So, no, KL-divergence is better not to be interpreted as a "distance measure" between distributions, but rather as a measure of entropy increase due to the use of an approximation to the true distribution rather than the true distribution itself.

So we are in Information Theory land. To hear it from the masters (Cover & Thomas) "

...if we knew the true distribution $P$ of the random variable, we could construct a code with average description length $H(P)$. If, instead, we used the code for a distribution $Q$, we would need $H(P) + \mathbb K (P||Q)$ bits on the average to describe the random variable.

The same wise people say

...it is not a true distance between distributions since it is not symmetric and does not satisfy the triangle inequality. Nonetheless, it is often useful to think of relative entropy as a “distance” between distributions.

But this latter approach is useful mainly when one attempts to minimize KL-divergence in order to optimize some estimation procedure. For the interpretation of its numerical value per se, it is not useful, and one should prefer the "entropy increase" approach.

For the specific distributions of the question (always using base-2 logarithms)

$$ \mathbb K\left(P||Q\right) = 0.49282,\;\;\;\; H(P) = 1.9486$$

In other words, you need 25% more bits to describe the situation if you are going to use $Q$ while the true distribution is $P$. This means longer code lines, more time to write them, more memory, more time to read them, higher probability of mistakes etc... it is no accident that Cover & Thomas say that KL-Divergence (or "relative entropy") "measures the inefficiency caused by the approximation."

Related Solutions

Solved – Estimate the Kullback-Leibler divergence

Mathematica, using symbolic integration (not an approximation!), reports a value that equals 1.6534640367102553437 to 20 decimal digits.

red = GammaDistribution[20/17, 17/20];
gray = InverseGaussianDistribution[1, 832/1000];
kl[pF_, qF_] := Module[{p, q},
   p[x_] := PDF[pF, x];
   q[x_] := PDF[qF, x];
   Integrate[p[x] (Log[p[x]] - Log[q[x]]), {x, 0, \[Infinity]}]
   ];
kl[red, gray]

In general, using a small Monte-Carlo sample is inadequate for computing these integrals. As we saw in another thread, the value of the KL divergence can be dominated by the integral in short intervals where the base PDF is nonzero and the other PDF is close to zero. A small sample can miss such intervals entirely and it can take a large sample to hit them enough times to obtain an accurate result.

Take a look at the Gamma PDF (red, dashed) and the log PDF (gray) for the Inverse Gaussian near 0:

Gamma pdf

Log of Inverse Gamma pdf

In this case, the Gamma PDF stays high near 0 while the log of the Inverse Gaussian PDF diverges there. Virtually all of the KL divergence is contributed by values in the interval [0, 0.05], which has a probability of 3.2% under the Gamma distribution. The number of elements in a sample of $N = 1000$ Gamma variates which fall in this interval therefore has a Binomial(.032, $N$) distribution. Its standard deviation equals $.032(1 - .032)/\sqrt{N}$ = 0.55%. Thus, as a rough estimate, we can't expect your integral to have a relative accuracy much better than (3.2% - 2*0.55%) / 3.2% = give or take around 40%, because the number of times it samples this critical interval has this amount of error. That accounts for the difference between your result and Mathematica's. To get this error down to 1%--a mere two decimal places of precision--you would need to multiply your sample approximately by $(40/1)^2 = 1600$: that is, you would need a few million values.

Here is a histogram of the natural logarithms of 1000 independent estimates of the KL divergence. Each estimate averages 1000 values randomly obtained from the Gamma distribution. The correct value is shown as the dashed red line.

Histogram

Histogram[Log[Table[
 Mean[Log[PDF[red, #]/PDF[gray, #]] & /@ RandomReal[red, 1000]], {i,1,1000}]]]

Although on the average this Monte-Carlo method is unbiased, most of the time (87% in this simulation) the estimate is too low. To make up for this, on occasion the overestimate can be gross: the largest of these estimates is 18.98. (The wide spread of values shows that the estimate of 1.286916 actually has no reliable decimal digits!) Because of this huge skewness in the distribution, the situation is actually much worse than I previously estimated with the simple binomial thought experiment. The average of these simulations (comprising 1000*1000 values total) is just 1.21, still about 25% less than the true value.

For computing the KL divergence in general, you need to use adaptive quadrature or exact methods.

Solved – Smallest Kullback-Leibler divergence

If you express the Kullback–Leibler divergence when $p_2$ is a normal pdf on $\mathbb R^d$, \begin{align} D&(p_1||p_2) =\int_{\mathbb R^d} p_1 \log p_1 \text{d}\lambda - \int_{\mathbb R^d} p_1 \log p_2 \text{d}\lambda\\ &= \int_{\mathbb R^d} p_1 \log p_1 \text{d}\lambda - \dfrac{1}{2} \int_{\mathbb R^d} p_1 \left\{-(x-\mu)^T \Sigma^{-1} (x-\mu) - \log |\Sigma| -d \log 2\pi \right\} \text{d}\lambda \\ &= \int_{\mathbb R^d} p_1 \log p_1 \text{d}\lambda + \dfrac{1}{2} \left\{ \log |\Sigma| + d \log 2\pi + \mathbb{E}_1 \left[ (x-\mu)^T \Sigma^{-1} (x-\mu) \right] \right\} \end{align} Now $$ \mathbb{E}_1 \left[ (x-\mu)^T \Sigma^{-1} (x-\mu) \right]= \mathbb{E}_1 \left[ (x-\mathbb{E}_1[x] )^T \Sigma^{-1} (x-\mathbb{E}_1[x]) \right]$$ $$\qquad\qquad\qquad + (\mathbb{E}_1[x]-\mu)^T \Sigma^{-1} (\mathbb{E}_1[x]-\mu) $$ so the minimum in $\mu$ is indeed reached for $\mu=\mathbb{E}_1[x]$.

Minimising $$ \log |\Sigma| + \mathbb{E}_1 \left[ (x-\mathbb{E}_1[x] )^T \Sigma^{-1} (x-\mathbb{E}_1[x]) \right] = $$ $$ \log |\Sigma| + \mathbb{E}_1 \left[ \text{trace} \left\{ (x-\mathbb{E}_1[x] )^T \Sigma^{-1} (x-\mathbb{E}_1[x]) \right\}\right] = \qquad\qquad\qquad $$ $$ \log |\Sigma| + \mathbb{E}_1 \left[ \text{trace} \left\{ \Sigma^{-1} (x-\mathbb{E}_1[x]) (x-\mathbb{E}_1[x] )^T \right\}\right] = $$ $$ \log |\Sigma| + \text{trace} \left\{ \Sigma^{-1} \mathbb{E}_1 \left[ (x-\mathbb{E}_1[x]) (x-\mathbb{E}_1[x] )^T \right] \right\}= $$ $$ \qquad\qquad \log |\Sigma| + \text{trace} \left\{ \Sigma^{-1} \Sigma_1 \right\} $$ leads to a minimum in $\Sigma$ for $\Sigma=\Sigma_1$.

Best Answer

Related Solutions

Solved – Estimate the Kullback-Leibler divergence

Solved – Smallest Kullback-Leibler divergence

Related Question