Solved – How meaningful is the connection between MLE and cross entropy in deep learning

cross entropydeep learningmaximum likelihood

I understand that given a set of $m$ independent observations
$\mathbb{O}=\{\mathbf{o}^{(1)}, . . . , \mathbf{o}^{(m)}\}$
the Maximum Likelihood Estimator (or, equivalently, the MAP with flat/uniform prior) that identifies the parameters $\mathbf{θ}$ that produce the model distribution $p_{model}\left(\,\cdot\, ; \mathbf{θ}\right)$
that best matches those observations will be

$$\mathbf{θ}_{ML}(\mathbb{O})= p_{model}\left(\mathbb{O}; \mathbf{θ}\right) = \underset{\mathbf{θ}}{\arg\max}‎‎\prod_{i=1}^{m} p_{model}\left(\mathbf{o}^{(i)}; \mathbf{θ}\right)$$

or, more conveniently

$$\mathbf{θ}_{ML}(\mathbb{O})= \underset{\mathbf{θ}}{\arg\min}\sum_{i=1}^{m} -\log p_{model}\left(\mathbf{o}^{(i)}; \mathbf{θ}\right)$$

and see the role that $\mathbf{θ}_{ML}$ can play in defining a loss function for multi-class deep neural networks, in which $\mathbf{θ}$ corresponds to the the network's trainable parameters (e.g., $\mathbf{θ} = \{\mathbf{W}, \mathbf{b}\} )$ and the observations are the pairs of input activations $\mathbf{x}$ and corresponding correct class labels $y \in [1, k]$, $\mathbf{o}^{(i)}$ = {$\mathbf{x}^{(i)}, y^{(i)}$}, by taking

$$p_{model}\left(\mathbf{o}^{(i)}; \mathbf{θ}\right) \equiv p_{model}\left(y^{(i)} | \mathbf{x}^{(i)}; \mathbf{θ}\right)$$

What I don't understand is how this relates to the so called "cross entropy" of the (vectorized) correct output, $\mathbf{y}^{(i)}$, and the corresponding output activations of the network, $\mathbf{a}(\mathbf{x}^{(i)}; \mathbf{θ})$
$$H(\mathbf{o}^{(i)}; \mathbf{θ}) = -\mathbf{y}^{(i)}\cdot \mathbf{log}\,\mathbf{a}(\mathbf{x}^{(i)}; \mathbf{θ})‎$$
that is used in practice when measuring error/loss during training. There are several related issues:

Activations "as probabilities"

One of the steps in establishing the relationship between MLE and cross entropy is to use the output activations "as if" they are probabilities. But it's not clear to me that they are, or at least that they $all$ are.

In calculating training error — specifically, in calling it a "cross entropy loss" — it is assumed that (after normalizing activations to sum to 1)

$$p_{model}\left(\mathbf{o}^{(i)}; \mathbf{θ}\right) \equiv a_{y^{(i)}}(\mathbf{x}^{(i)}; \mathbf{θ})\tag{1}\label{1}‎‎$$

or

$$\log p_{model}\left(\mathbf{o}^{(i)}; \mathbf{θ}\right) = \log a_{y^{(i)}}(\mathbf{x}^{(i)}; \mathbf{θ})‎‎$$

so that we can write

$$-\log p_{model}\left(\mathbf{o}^{(i)}; \mathbf{θ}\right) = -\mathbf{y}^{(i)}\cdot \mathbf{log}\,\mathbf{a}(\mathbf{x}^{(i)}; \mathbf{θ})‎\tag{3}\label{3}$$

and thus

$$\mathbf{θ}_{ML}(\mathbb{O})=\underset{\mathbf{θ}}{\arg\min}\sum_{i=1}^{m} H(\mathbf{o}^{(i)}; \mathbf{θ})$$

But while this certainly makes $a_{y^{(i)}}(\mathbf{x}^{(i)}; \mathbf{θ}_{ML})$ a probability (to the extent that anything is), it places no restrictions on the other activations.

Can the $\mathbf{a}_{y^{(i)}}(\mathbf{x}^{(i)}; \mathbf{θ}_{ML})$ really be said to be PMFs in that case? Is there anything that makes the $a_{y^{(i)}}(\mathbf{x}^{(i)}; \mathbf{θ}_{ML})$ not in fact probabilities (and merely "like" them)?

Limitation to categorization

The crucial step above in equating MLE with cross-entropy relies entirely on the "one-hot" structure of $\mathbf{y}^{(i)}$ that characterizes a (single-label) multi-class learning problem. Any other structure for the $\mathbf{y}^{(i)}$ would make it impossible to get from $\eqref{1}$ to $\eqref{3}$.

Is the equation of MLE and cross-entropy minimization limited to cases where the $\mathbf{y}^{(i)}$ are "one-hot"?

Different training and prediction probabilities

During prediction, it is almost always the case that

$$p_{model}\left(y^{(i)} | \mathbf{x}^{(i)}; \mathbf{θ}\right) \equiv P\left(\underset{j\in[1,k]}{\arg\max}\,a_j(\mathbf{x}^{(i)}; \mathbf{θ}) = y^{(i)}\right)\tag{2}\label{2}$$

which results in correct prediction probabilities that are different from the probabilities learned during training unless it is reliably the case that

$$a_{y^{(i)}}(\mathbf{x}^{(i)}; \mathbf{θ}_{ML}) = P\left(\underset{j\in[1,k]}{\arg\max}\,a_j(\mathbf{x}^{(i)}; \mathbf{θ}_{ML}) = y^{(i)}\right)$$

Is this ever reliably the case? Is it likely at least approximately true? Or is there some other argument that justifies this equation of the value of the learned activation at the label position with the probability that the maximum value of learned activations occurs there?

Entropy and information theory

Even assuming that the above concerns are addressed and the activations are valid PMFs (or can meaningfully be treated as such), so that the role played by cross entropy in computing $\mathbf{θ}_{ML}$ is unproblematic, it's not clear to me why it is helpful or meaningful to talk about the entropy of the $\mathbf{a}(\mathbf{x}^{(i)}; \mathbf{θ}_{ML})$, since Shanon entropy applies to a specific kind of encoding, which is not the one being used in training the network.

What role does information theoretic entropy play in interpreting the cost function, as opposed to simply providing a tool (in the form of cross entropy) for computing one (that corresponds to MLE)?

Best Answer

Neural nets don't necessarily give probabilities as outputs, but they can be designed to do this. To be interpreted as probabilities, a set of values must be nonnegative and sum to one. Designing a network to output probabilities typically amounts to choosing an output layer that imposes these constraints. For example, in a classification problem with $k$ classes, a common choice is a softmax output layer with $k$ units. The softmax function forces the outputs to be nonnegative and sum to one. The $j$th output unit gives the probability that the class is $j$. For binary classification problems, another popular choice is to use a single output unit with logistic activation function. The output of the logistic function is between zero and one, and gives the probability that the class is 1. The probability that the class is 0 is implicitly one minus this value. If the network contains no hidden layers, then these two examples are equivalent to multinomial logistic regression and logistic regression, respectively.

Cross entropy $H(p, q)$ measures the difference between two probability distributions $p$ and $q$. When cross entropy is used as a loss function for discriminative classifiers, $p$ and $q$ are distributions over class labels, given the input (i.e. a particular data point). $p$ is the 'true' distribution and $q$ is the distribution predicted by the model. In typical classification problems, each input in the dataset is associated with an integer label representing the true class. In this case, we use the empirical distribution for $p$. This simply assigns probability 1 to the true class of a data point, and probability 0 to all other classes. $q$ is the distribution of class probabilities predicted by the network (e.g. as described above).

Say the data are i.i.d., $p_i$ is the empirical distribution, and $q_i$ is the predicted distribution (for the $i$th data point). Then, minimizing the cross entropy loss (i.e. $H(p_i, q_i)$ averaged over data points) is equivalent to maximizing the likelihood of the data. The proof is relatively straightforward. The basic idea is to show that the cross entropy loss is proportional to a sum of negative log predicted probabilities of the data points. This falls out neatly because of the form of the empirical distribution.

Cross entropy loss can also be applied more generally. For example, in 'soft classification' problems, we're given distributions over class labels rather than hard class labels (so we don't use the empirical distribution). I describe how to use cross entropy loss in that case here.

To address some other specifics in your question:

Different training and prediction probabilities

It looks like you're finding the output unit with maximum activation and comparing this to the class label. This isn't done for training using the cross entropy loss. Instead, the probabilities output by the model are compared to the 'true' probabilities (typically taken to be the empirical distribution).

Shanon entropy applies to a specific kind of encoding, which is not the one being used in training the network.

Cross entropy $H(p,q)$ can be interpreted as the number of bits per message needed (on average) to encode events drawn from true distribution $p$, if using an optimal code for distribution $q$. Cross entropy takes a minimum value of $H(p)$ (the Shannon entropy of $p$) when $q = p$. The better the match between $q$ and $p$, the shorter the message length. Training a model to minimize the cross entropy can be seen as training it to better approximate the true distribution. In supervised learning problems like we've been discussing, the model gives a probability distribution over possible outputs, given the input. Explicitly finding optimal codes for the distribution isn't part of the process.

Related Question