Solved – Disadvantages of the Kullback-Leibler divergence

estimationkullback-leibler

I'm working on a calibration problem which involves the usage of the Kullback-Leibler divergence as an error between some empirical distribution $p$ and a theoretical distribution $q$. In the model, the $q$ distribution is normal with some fixed parameters. I have two questions:

Is the Kullback-Leibler divergence the best f-divergence to consider as error?
Does the usage of the Kullback-Leibler divergence entail any kind of issue?

Best Answer

I'd like to add the first answer, which would be unsatisfying, to this question through the lens of deep learning mostly in NLP:

First things first,

Disadvantages of the Kullback-Leibler divergence

Let's see the definition (in terms of your question):
$$ KL(q||p)=\sum q(s)\log \frac{q(s)}{p(s)} $$ When $p(s) > 0$ and $q(s)\to 0$, the KL divergence shrinks to 0, which means MLE assigns an extremely low cost to the scenarios, where the model generates some samples that do not locate on the data distribution.

Consider this, the corpus in hand includes the whole samples existing in the world then $q(s) \to 0$ indicates that $s$ occurs very rarely in the corpus (the law of large number), and then its probability may happen to be very large (due to samples lookalike but different or opposite in fact). In this case, because of the lack of training for this kind of category and hence its high probabilities in the distribution, such rare samples that do not locate on the data distribution may be generated while we are testing or validating.

For your sub-questions:

Is the Kullback-Leibler divergence the best f-divergence to consider as error?

You can refer to this answer which states that "Cross-entropy is prefered for classification, while mean squared error is one of the best choices for regression". Note that training by cross entropy is the same as training using relative entropy. For the details please refer to this.

Does the usage of the Kullback-Leibler divergence entail any kind of issue?

If I understand your question correctly, I suppose it can be falsified by the loss function for SVMs. Please refer to this question and this answer. Kullback-Leibler divergence can not solve all problems in estimation.

Related Solutions

Solved – Kullback–Leibler divergence between two gamma distributions

The KL divergence is a difference of integrals of the form

$$\begin{aligned} I(a,b,c,d)&=\int_0^{\infty} \log\left(\frac{e^{-x/a}x^{b-1}}{a^b\Gamma(b)}\right) \frac{e^{-x/c}x^{d-1}}{c^d \Gamma(d)}\, \mathrm dx \\ &=-\frac{1}{a}\int_0^\infty \frac{x^d e^{-x/c}}{c^d\Gamma(d)}\, \mathrm dx - \log(a^b\Gamma(b))\int_0^\infty \frac{e^{-x/c}x^{d-1}}{c^d\Gamma(d)}\, \mathrm dx\\ &\quad+ (b-1)\int_0^\infty \log(x) \frac{e^{-x/c}x^{d-1}}{c^d\Gamma(d)}\, \mathrm dx\\ &=-\frac{cd}{a} - \log(a^b\Gamma(b)) + (b-1)\int_0^\infty \log(x) \frac{e^{-x/c}x^{d-1}}{c^d\Gamma(d)}\,\mathrm dx \end{aligned}$$

We just have to deal with the right hand integral, which is obtained by observing

$$\eqalign{ \frac{\partial}{\partial d}\Gamma(d) =& \frac{\partial}{\partial d}\int_0^{\infty}e^{-x/c}\frac{x^{d-1}}{c^d}\, \mathrm dx\\ =& \frac{\partial}{\partial d} \int_0^\infty e^{-x/c} \frac{(x/c)^{d-1}}{c}\, \mathrm dx\\ =&\int_0^\infty e^{-x/c}\frac{x^{d-1}}{c^d} \log\frac{x}{c} \, \mathrm dx\\ =&\int_0^{\infty}\log(x)e^{-x/c}\frac{x^{d-1}}{c^d}\, \mathrm dx - \log(c)\Gamma(d). }$$

Whence

$$\frac{b-1}{\Gamma(d)}\int_0^{\infty} \log(x)e^{-x/c}(x/c)^{d-1}\, \mathrm dx = (b-1)\frac{\Gamma'(d)}{\Gamma(d)} + (b-1)\log(c).$$

Plugging into the preceding yields

$$I(a,b,c,d)=\frac{-cd}{a} -\log(a^b\Gamma(b))+(b-1)\frac{\Gamma'(d)}{\Gamma(d)} + (b-1)\log(c).$$

The KL divergence between $\Gamma(c,d)$ and $\Gamma(a,b)$ equals $I(c,d,c,d) - I(a,b,c,d)$, which is straightforward to assemble.

Implementation Details

Gamma functions grow rapidly, so to avoid overflow don't compute Gamma and take its logarithm: instead use the log-Gamma function that will be found in any statistical computing platform (including Excel, for that matter).

The ratio $\Gamma^\prime(d)/\Gamma(d)$ is the logarithmic derivative of $\Gamma,$ generally called $\psi,$ the digamma function. If it's not available to you, there are relatively simple ways to approximate it, as described in the Wikipedia article.

Here, to illustrate, is a direct R implementation of the formula in terms of $I$. This does not exploit an opportunity to simplify the result algebraically, which would make it a little more efficient (by eliminating a redundant calculation of $\psi$).

#
# `b` and `d` are Gamma shape parameters and
# `a` and `c` are scale parameters.
# (All, therefore, must be positive.)
#
KL.gamma <- function(a,b,c,d) {
  i <- function(a,b,c,d)
    - c * d / a - b * log(a) - lgamma(b) + (b-1)*(psigamma(d) + log(c))
  i(c,d,c,d) - i(a,b,c,d)
}
print(KL.gamma(1/114186.3, 202, 1/119237.3, 195), digits=12)

Solved – Symmetric Kullback-Leibler divergence OR Mutual Information as a metric of distance between two distributions

As far as I can understand you are solving the following problem: there are two analytical distributions $p(x)$ and $q(x)$, and you want to calculate distance between them, $D(p, q)$.

There are a plenty of measures of distance between two distributions:

I suggest you to try a few from the list above as all of them are rather easy to implement. In most applications numerical experiments is what give you a key to success. Then you can select one that suits you the best (as I haven't found any requirements for this distance in your question I can't suggest you anything else).

As $p(x)$ and $q(x)$ are complex it is almost impossible that there exists analytical expression for some $D(p, q)$, so you will need a numerical way to calculate those distances. Note, that calculation almost all distances involves numerical integration - so they will be rather imprecise if $x$ dimension is high.

Best Answer

Related Solutions

Solved – Kullback–Leibler divergence between two gamma distributions

Implementation Details

Solved – Symmetric Kullback-Leibler divergence OR Mutual Information as a metric of distance between two distributions

Related Question