Solved – How does Cross-Entropy (log loss) work with backpropagation

backpropagationcross entropylog-lossregression

I am having some trouble understanding how Cross Entropy would work with backpropagation.
For backpropagation we exploit the chain rule to find the partial derivative of the Error function in terms of the weight. This would mean that we need the derivative of the Cross Entropy function just as we would do it with the Mean Squared Error.
If I differentiate log loss I get a function which is non-defined for some values. I assume that is not acceptable as it would try to update the weight with NaN.

Cross-Entropy can be written as the following for one instance:

enter image description here

(Here x denotes the predicted value by the network, while y is the label.)

When we did differentiation with the Mean Squared Error, we simply got back

(x-y)

It is well-defined for all values. Now, if we were to differentiate the aforementioned function it would yield the following equation:

enter image description here

We can easily divide by zero if our values are actually correct in the case of binary regression.
That is why I assume that this is not exactly how we deal with Cross-Entropy. Or am I wrong?
Now it might be that I am getting confused but at a lot of sites this is combined with the Softmax function.

Is the process a little different than it was with MSE? Meaning that we first find the derivative of the Cross-Entropy function just as we would do with MSE and then everything is the same?

enter image description here

Best Answer

For any finite input, softmax outputs are strictly between 0 and 1. This implies that you'll never divide by zero. This is one reason that you've observed softmax and cross-entropy are commonly used together.

Finite-precision arithmetic can result in numerical underflow, especially when round-tripping exponentiation. This can be avoided by working on the logit scale directly; for example: http://pytorch.org/docs/stable/nn.html#crossentropyloss

See also: Infinities with cross entropy in practice

Related Question