Neural Networks – How to Construct a Cross-Entropy Loss for General Regression Targets

cross entropyloss-functionsmaximum likelihoodneural networks

It's common short-hand in neural networks literature to refer to categorical cross-entropy loss as simply "cross-entropy." However, this terminology is ambiguous because different probability distributions have different cross-entropy loss functions.

So, in general, how does one move from an assumed probability distribution for the target variable to defining a cross-entropy loss for your network? What does the function require as inputs? (For example, the categorical cross-entropy function for one-hot targets requires a one-hot binary vector and a probability vector as inputs.)

A good answer will discuss the general principles involved, as well as worked examples for

  • categorical cross-entropy loss for one-hot targets
  • Gaussian-distributed target distribution and how how this reduces to usual MSE loss
  • A less common example such as a gamma distributed target, or a heavy-tailed target
  • Explain the relationship between minimizing cross entropy and maximizing log-likelihood.

Best Answer

Suppose that we are trying to infer the parametric distribution $p(y|\Theta(X))$, where $\Theta(X)$ is a vector output inverse link function with $[\theta_1,\theta_2,...,\theta_M]$.

We have a neural network at hand with some topology we decided. The number of outputs at the output layer matches the number of parameters we would like to infer (it may be less if we don't care about all the parameters, as we will see in the examples below).

enter image description here

In the hidden layers we may use whatever activation function we like. What's crucial are the output activation functions for each parameter as they have to be compatible with the support of the parameters.

enter image description here

Some example correspondence:

  • Linear activation: $\mu$, mean of Gaussian distribution
  • Logistic activation: $\mu$, mean of Bernoulli distribution
  • Softplus activation: $\sigma$, standard deviation of Gaussian distribution, shape parameters of Gamma distribution

Definition of cross entropy:

$$H(p,q) = -E_p[\log q(y)] = -\int p(y) \log q(y) dy$$

where $p$ is ideal truth, and $q$ is our model.

Empirical estimate:

$$H(p,q) \approx -\frac{1}{N}\sum_{i=1}^N \log q(y_i)$$

where $N$ is number of independent data points coming from $p$.

Version for conditional distribution:

$$H(p,q) \approx -\frac{1}{N}\sum_{i=1}^N \log q(y_i|\Theta(X_i))$$

Now suppose that the network output is $\Theta(W,X_i)$ for a given input vector $X_i$ and all network weights $W$, then the training procedure for expected cross entropy is:

$$W_{opt} = \arg \min_W -\frac{1}{N}\sum_{i=1}^N \log q(y_i|\Theta(W,X_i))$$

which is equivalent to Maximum Likelihood Estimation of the network parameters.

Some examples:

$$\mu = \theta_1 : \text{linear activation}$$ $$\sigma = \theta_2: \text{softplus activation*}$$ $$\text{loss} = -\frac{1}{N}\sum_{i=1}^N \log [\frac{1} {\theta_2(W,X_i)\sqrt{2\pi}}e^{-\frac{(y_i-\theta_1(W,X_i))^2}{2\theta_2(W,X_i)^2}}]$$

under homoscedasticity we don't need $\theta_2$ as it doesn't affect the optimization and the expression simplifies to (after we throw away irrelevant constants):

$$\text{loss} = \frac{1}{N}\sum_{i=1}^N (y_i-\theta_1(W,X_i))^2$$

$$\mu = \theta_1 : \text{logistic activation}$$ $$\text{loss} = -\frac{1}{N}\sum_{i=1}^N \log [\theta_1(W,X_i)^{y_i}(1-\theta_1(W,X_i))^{(1-y_i)}]$$ $$= -\frac{1}{N}\sum_{i=1}^N y_i\log [\theta_1(W,X_i)] + (1-y_i)\log [1-\theta_1(W,X_i)]$$

with $y_i \in \{0,1\}$.

  • Regression: Gamma response

$$\alpha \text{(shape)} = \theta_1 : \text{softplus activation*}$$ $$\beta \text{(rate)} = \theta_2: \text{softplus activation*}$$

$$\text{loss} = -\frac{1}{N}\sum_{i=1}^N \log [\frac{\theta_2(W,X_i)^{\theta_1(W,X_i)}}{\Gamma(\theta_1(W,X_i))} y_i^{\theta_1(W,X_i)-1}e^{-\theta_2(W,X_i)y_i}]$$

Some constraints cannot be handled directly by plain vanilla neural network toolboxes (but these days they seem to do very advanced tricks). This is one of those cases:

$$\mu_1 = \theta_1 : \text{logistic activation}$$ $$\mu_2 = \theta_2 : \text{logistic activation}$$ ... $$\mu_K = \theta_K : \text{logistic activation}$$

We have a constraint $\sum \theta_i = 1$. So we fix it before we plug them into the distribution:

$$\theta_i' = \frac{\theta_i}{\sum_{j=1}^K \theta_j}$$

$$\text{loss} = -\frac{1}{N}\sum_{i=1}^N \log [\Pi_{j=1}^K\theta_i'(W,X_i)^{y_{i,j}}]$$

Note that $y$ is a vector quantity in this case. Another approach is the Softmax.

*ReLU is unfortunately not a particularly good activation function for $(0,\infty)$ due to two reasons. First of all it has a dead derivative zone on the left quadrant which causes optimization algorithms to get trapped. Secondly at exactly 0 value, many distributions would go singular for the value of the parameter. For this reason it is usually common practice to add a small value $\epsilon$ to assist off-the shelf optimizers and for numerical stability.

As suggested by @Sycorax Softplus activation is a much better replacement as it doesn't have a dead derivative zone.

enter image description here

Summary:

  1. Plug the network output to the parameters of the distribution and take the -log then minimize the network weights.
  2. This is equivalent to Maximum Likelihood Estimation of the parameters.