Cross-Entropy Regression – Using Cross-Entropy for Regression Problems

classificationcross entropykullback-leiblerregression

I usually see a discussion of the following loss functions in the context of the following types of problems:

  • Cross entropy loss (KL divergence) for classification problems
  • MSE for regression problems

However, my understanding (see here) is that doing MLE estimation is equivalent to optimizing the negative log likelihood (NLL) which is equivalent to optimizing KL and thus the cross entropy.

So:

  • Why isn't KL or CE used also for regression problems?
  • What's the relationship between CE and MSE for regresion? Are they one and the same loss under some circumstances?
  • If different, what's the benefit of using MSE for regression instead?

Related questions:

Best Answer

In a regression problem you have pairs $(x_i, y_i)$. And some true model $q$ that characterizes $q(y|x)$. Let's say you assume that your density

$$f_\theta(y|x)= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{1}{2\sigma^2}(y_i-\mu_\theta(x_i))^2\right\}$$

and you fix $\sigma^2$ to some value

The mean $\mu(x_i)$ is then e.g. modelled via a a neural network (or any other model)

Writing the empirical approximation to the cross entropy you get:

$$\sum_{i = 1}^n-\log\left( \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{1}{2\sigma^2}(y_i-\mu_\theta(x_i))^2\right\} \right)$$

$$=\sum_{i = 1}^n-\log\left( \frac{1}{\sqrt{2\pi\sigma^2}}\right) +\frac{1}{2\sigma^2}(y_i-\mu_\theta(x_i))^2$$

If we e.g. set $\sigma^2 = 1$ (i.e. assume we know the variance; we could also model the variance than our neural network had two ouputs, i.e. one for the mean and one for the variance) we get:

$$=\sum_{i = 1}^n-\log\left( \frac{1}{\sqrt{2\pi}}\right) +\frac{1}{2}(y_i-\mu_\theta(x_i))^2$$

Minimizing this is equivalent to the minimization of the $L2$ loss.

So we have seen that minimizing CE with the assumption of normality is equivalent to the minimization of the $L2$ loss