Solved – the relationship between maximizing the likelihood and minimizing the cross-entropy

cross entropymachine learningmathematical-statisticsmaximum likelihood

There is a statement that maximizing the likelihood is equivalent to minimizing the cross-entropy. Are there any proof for this statement?

Best Answer

Here's a worked example in the case of iid binary data, each with a success/failure recorded as $y_i \in \{0,1\}$.

For labels $y_i\in \{0,1\}$, the likelihood of some binary data under the Bernoulli model with parameters $\theta$ is $$ \mathcal{L}(\theta) = \prod_{i=1}^n p(y_i=1|\theta)^{y_i}p(y_i=0|\theta)^{1-y_i}\\ $$ whereas the log-likelihood is $$ \log\mathcal{L}(\theta) = \sum_{i=1}^n y_i\log p(y=1|\theta) + (1-y_i)\log p(y=0|\theta) $$

And the binary cross-entropy is $$ L(\theta) = -\frac{1}{n}\sum_{i=1}^n y_i\log p(y=1|\theta) + (1-y_i)\log p(y=0|\theta) $$

Clearly, $ \log \mathcal{L}(\theta) = -nL(\theta) $.

We know that an optimal parameter vector $\theta^*$ is the same for both because we can observe that for any $\theta$ which is not optimal, we have $\frac{1}{n} L(\theta) > \frac{1}{n} L(\theta^*)$, which holds for any $\frac{1}{n} > 0$. (Remember, we want to minimize cross-entropy, so the optimal $\theta^*$ has the least $L(\theta^*)$.)

Likewise, we know that the optimal value $\theta^*$ is the same for $\log \mathcal{L}(\theta)$ and $ \mathcal{L}(\theta)$ because $\log(x)$ is a monotonic increasing function for $x \in \mathbb{R}^+$, so we can write $\log \mathcal{L}(\theta) < \log\mathcal{L}(\theta^*)$. (Remember, we want to maximize the likelihood, so the optimal $\theta^*$ has the most $\mathcal{L}(\theta^*)$.)

Some sources omit the $\frac{1}{n}$ from the cross-entropy. Clearly, this only changes the value of $L(\theta)$, but not the location of the optima, so from an optimization perspective the distinction is not important. The negative sign, however, is obviously important since it is the difference between maximizing and minimizing!


Some more additional examples and more general result can be found in this related thread: How to construct a cross-entropy loss for general regression targets?

Related Question