Neural Networks – Cross Entropy vs KL Divergence: What’s Minimized Directly in Practice?

cross entropykullback-leiblermaximum likelihoodneural networksrisk

My understanding is that in ML one can establish a connection between these quantities using the following line of reasoning:

  1. Assuming we plan to use ML to make decisions, we choose to minimize our Risk against a well defined loss function that scores those decisions. Since we often don't know the true distribution of the data, we can't directly minimize this Risk (our expected loss), and instead choose to minimize our Empirical Risk i.e. ER (or structural risk, if using regularization). It's empirical because we compute this risk as an average of the loss function on observed data.

  2. If we assume that our model can output probabilities for those decisions, and we are solving a problem that involves hard decisions for which we have some ground truth examples, we can model the optimization of those decisions as minimizing ER with a cross-entropy loss function, and thus model decisions as a problem of classifying data. Under this loss, the ER is actually the same (not just equivalent) to the negative log likelihood (NLL) of the model for the observed data. So one can interpret minimizing ER as finding an MLE solution for our probabilistic model given the data.

  3. From the above, we can also establish that the CE is equivalent to minimizing a KL divergence between our model (e.g. Q) for generating decisions and the true model (P) that generates the actual data and decisions. This is apparently a nice result, because one can argue that while we don't know the true data generating (optimal decision making) distribution, we can establish that we are doing "our best" to estimate it, in a KL sense. However, CE is not the same as KL. They measure different things and of course take on different values.

Is the above line of reasoning correct? Or do people e.g. use cross-entropy and KL divergence for problems other than classification? Also, does the "CE ≡ KL ≡ NLL" equivalence relationship (in terms of optimization solutions) always hold?

In either case, what is minimized in practice directly (KL vs the CE) and in what circumstances?


Motivation

Consider the following from a question on this site:

"The KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

[From the comments] In my own experience … BCE is way more robust than KL. Basically, KL was unusable. KL and BCE aren't "equivalent" loss functions".

I have read similar statements online. That these two quantities are not the same, and in practice we use one (or the other) for optimization. Is that actually the case? If so, which quantity is actually evaluated and optimized directly in practice, for what types of problems, and why?

Related questions:

Best Answer

Let $q$ be the density of your true data-generating process and $f_\theta$ be your model-density.

Then $$KL(q||f_\theta) = \int q(x) log\left(\frac{q(x)}{f_\theta(x)}\right)dx = -\int q(x) \log(f_\theta(x))dx + \int q(x) \log(q(x)) dx$$

The first term is the Cross Entropy $H(q, f_\theta)$ and the second term is the (differential) entropy $H(q)$. Note that the second term does NOT depend on $\theta$ and therefore you cannot influence it anyway. Therfore minimizing either Cross-Entropy or KL-divergence is equivalent.

Without looking at the formula you can understand it the following informal way (if you assume a discrete distribution). The entropy $H(q)$ encodes how many bits you need if you encode the signal that comes from the distribution $q$ in an optimal way. The Cross-Entropy $H(q, f_\theta)$ encodes how many bits on average you would need when you encoded the singal that comes from a distribution $q$ using the optimal coding scheme for $f_\theta$. This decomposes into the Entropy $H(q)$ + $KL(q||f_\theta)$. The KL-divergence therefore measures how many additional bits you need if you use an optimal coding scheme for distribution $f_\theta$ (i.e. you assume your data comes from $f_\theta$ while it is actually generated from $q$). This also explains why it has to be positive. You cannot be better than the optimal coding scheme that yields the average bit-length $H(q)$.

This illustrates in an informal way why minimizing KL-divergence is equivalent to minimizing CE: By minimzing how many more bits you need than the optimal coding scheme (on average) you of course also minimize the total amount of bits you need (on average)

The following post illustrates the idea with the optimal coding scheme: Qualitively what is Cross Entropy

Related Question