Solved – Disadvantages of the Kullback-Leibler divergence

estimationkullback-leibler

I'm working on a calibration problem which involves the usage of the Kullback-Leibler divergence as an error between some empirical distribution $p$ and a theoretical distribution $q$. In the model, the $q$ distribution is normal with some fixed parameters. I have two questions:

  1. Is the Kullback-Leibler divergence the best f-divergence to consider as error?
  2. Does the usage of the Kullback-Leibler divergence entail any kind of issue?

Best Answer

I'd like to add the first answer, which would be unsatisfying, to this question through the lens of deep learning mostly in NLP:

First things first,

Disadvantages of the Kullback-Leibler divergence

Let's see the definition (in terms of your question):
$$ KL(q||p)=\sum q(s)\log \frac{q(s)}{p(s)} $$ When $p(s) > 0$ and $q(s)\to 0$, the KL divergence shrinks to 0, which means MLE assigns an extremely low cost to the scenarios, where the model generates some samples that do not locate on the data distribution.

Consider this, the corpus in hand includes the whole samples existing in the world then $q(s) \to 0$ indicates that $s$ occurs very rarely in the corpus (the law of large number), and then its probability may happen to be very large (due to samples lookalike but different or opposite in fact). In this case, because of the lack of training for this kind of category and hence its high probabilities in the distribution, such rare samples that do not locate on the data distribution may be generated while we are testing or validating.

For your sub-questions:

Is the Kullback-Leibler divergence the best f-divergence to consider as error?

You can refer to this answer which states that "Cross-entropy is prefered for classification, while mean squared error is one of the best choices for regression". Note that training by cross entropy is the same as training using relative entropy. For the details please refer to this.

Does the usage of the Kullback-Leibler divergence entail any kind of issue?

If I understand your question correctly, I suppose it can be falsified by the loss function for SVMs. Please refer to this question and this answer. Kullback-Leibler divergence can not solve all problems in estimation.

Related Question