Solved – Cross Entropy Loss for One Hot Encoding

backpropagationcategorical-encodingcross entropyloss-functionsneural networks

CE-loss sums up the loss over all output nodes

$\sum_i[ – target_i*\log(output_i) ]$.

The derivative of CE-loss is: $- \frac{target_i}{output_i}$.

Since for a target=0 the loss and derivative of the loss is zero regardless of the actual output, it seems like only the node with target=1 recieves feedback on how to adjust weights.

I also noticed the singularity in the derivative for output=0. How is this processed during backpropagation?

I do not see how the weights are adjusted to match the target=0.

Best Answer

Cross-entropy with one-hot encoding implies that the target vector is all $0$, except for one $1$. So all of the zero entries are ignored and only the entry with $1$ is used for updates. You can see this directly from the loss, since $0 \times \log(\text{something positive})=0$, implying that only the predicted probability associated with the label influences the value of the loss.

This works because the neural network prediction is a probability vector over mutually-exclusive outcomes, so by definition, the prediction vector must (1) have non-negative elements and (2) the elements must sum to 1. This means that making one part of the vector larger must shrink the sum of the remaining components by the same amount.

Usually for the case of one-hot labels, one uses the softmax activation function. Mathematically, softmax has asymptotes at 0 and 1, so singularities do not occur. As a matter of floating point arithmetic, overflow can occasionally result in $\log(1)$ or $\log(0)$. Usually these are avoided by rearranging the equations and working on a different scale, such as the logits (e.g. https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits)

A related question and more detailed calculus can be found in Backpropagation with Softmax / Cross Entropy