Solved – Cross Entropy with Log Softmax Activation

cross entropyloss-functionsmachine learning

My question is about how is log softmax implemented in practice with the cross-entropy loss.

Softmax gives values between 0 and 1, which means log softmax will give values between -infinity and 0. This means that we cannot use one-hot encoding (one 1 and rest 0's) for our target labels anymore (correct me if I am wrong). Our labels should now contain one 0 (which is the target) and rest all as -infinity.

The cross-entropy loss function is given as:

$ – \Sigma \ t_i*log(o_i) $

where $ t_i $ is the target label for the $ i^{th} $ training sample,

and $ o_i $ is the predicted output for the $ i^{th} $ training sample.

Now, when we use our target label (which is in the range -infinity and 0), the loss will become +infinity because of the -infinity term in the target vector and thus becoming numerically unstable. What is the way around this?

Another question – How does Pytorch handle this? Also, the nn.CrossEntropyLoss() function calculates the log_softmax on the predicted outputs internally but I cannot find anywhere in their documentation as to where they convert the one-hot target labels (range 0 to 1); which we pass to the loss function; to a label with range -infinity to 0. Am I wrong in assuming that the target label needs to be changed?

Any help will be appreciated. Thanks!

Best Answer

Mathematically, softmax with finite inputs produces results $o_i \in (0,1) \forall i$ such that $\sum_i o_i =1$. This implies that softmax is never 0, so $\log(o_i)$ is always a real number.

Numerically, overflow or underflow could cause softmax to output a zero. This is common enough when training neural networks using floating point numbers. A common work-around to avoid numerical underflow (or overflow) is to work on the log scale via log_softmax, or else work on the logit scale and do not transform your outputs, but instead have a loss function defined on the logit scale. These methods avoid round-tripping (which causes a loss of precision) and use numerical tricks to keep values in nice floating point ranges.

Obviously, working on the log scale, or the logit scale, requires making algebraic adjustments so that the loss is also on the appropriate scale. So if you use identity activations in the final layer, you use CrossEntropyLoss. If you use log_softmax in the final layer, you use NLLLoss.

Consider $0 < o_i < 1$ the probability output from the network, produced by softmax with finite input. We wish to compute the cross-entropy loss.

  • One option is to do things the naïve way, using $o_i$ and $t_i$ directly, and computing $-\sum_i t_i \log(o_i)$.
  • A second option is to use log-probabilities instead. This means you have $z_i = \log(o_i)$ in hand, so you compute $-\sum_i t_i \log(o_i) = -\sum t_i z_i$.

I can't answer the part of your question about re-labeling because it doesn't make sense. When you're using a numerically stable procedure, $\log(o_i)$ is always a finite number, so $t_i \log(o_i)$ for $y \in \{0,1\}$ is also finite. In fact, in the case of 1-hot labels, only one index $i$ has a non-zero value of $t_i \log (o_i)$.

See also: Infinities with cross entropy in practice