Cross-Entropy vs Log Likelihood – Use in Output Layers

maximum likelihoodneural networkssoftmax

I read this page:
http://neuralnetworksanddeeplearning.com/chap3.html

and it said that sigmoid output layer with cross-entropy is quite similiar with softmax output layer with log-likelihood.

what happen if I use sigmoid with log-likelihood or softmax with cross entropy in the output layer? is it fine? becuase I see there's only little difference in equation between cross entropy (eq.57):

$$C = -\frac{1}{n} \sum\limits_x (y \ln a + (1-y) \ln (1-a))$$

and log likelihood (eq.80):

$$C =-\frac{1}{n} \sum\limits_x(\ln a^L_y)$$

Best Answer

The negative log likelihood (eq.80) is also known as the multiclass cross-entropy (ref: Pattern Recognition and Machine Learning Section 4.3.4), as they are in fact two different interpretations of the same formula.

eq.57 is the negative log likelihood of the Bernoulli distribution, whereas eq.80 is the negative log likelihood of the multinomial distribution with one observation (a multiclass version of Bernoulli).

For binary classification problems, the softmax function outputs two values (between 0 and 1 and sum to 1) to give the prediction of each class. While the sigmoid function outputs one value (between 0 and 1) to give the prediction of one class (so the other class is 1-p).

So eq.80 can't be directly applied to the sigmoid output, though it is essentially the same loss as eq.57.

Also see this answer.


Following is a simple illustration of the connection between (sigmoid + binary cross-entropy) and (softmax + multiclass cross-entropy) for binary classification problems.

Say we take $0.5$ as the split point of the two categories, for sigmoid output it follows,

$$\sigma(wx+b)=0.5$$ $$wx+b=0$$ which is the decision boundary in the feature space.

For softmax output it follows $$\frac{e^{w_1x+b_1}}{e^{w_1x+b_1}+e^{w_2x+b_2}}=0.5$$ $$e^{w_1x+b_1}=e^{w_2x+b_2}$$ $$w_1x+b_1=w_2x+b_2$$ $$(w_1-w_2)x+(b_1-b_2)=0$$ so it remains the same model although there are twice as many parameters.

The followings show the decision boundaries obtained using theses two methods, which are almost identical.