Solved – Difference between mathematical and Tensorflow implementation of Softmax Crossentropy with logit

cross entropylogitloss-functionssoftmaxtensorflow

Softmax cross entropy with logits is define as follows:

$a_i = \frac{e^{z_i}}{\sum_{\forall j} e^{z_j}}$

$l={\sum_{\forall i}}y_ilog(a_i)$

Where $l$ is the actual loss.

But when you look deep into C++ Tensorflow implementation of SoftmaxCrossEntropyWithLogits operation, the exact formula which they use is descibed as:

$l={\sum_{\forall j}}y_j ((z_j-max(z))-log({\sum_{\forall i}}e^{z_i-max(z)}))$

The part: $z-max(z)$ – is perfectly understood – it is just normalization which helps to avoid under/overflow.

BUT:

  • Where is the actual Softmax in their implementation?

  • Why from each $z_j$ they subtract $log({\sum_{\forall i}}e^{z_i-max(z)})$ before multiply it by $y_j$?

Note: One may argue that the code I indicated is just Tensorflow's implementation of CrossEntropyWithLogits and not SoftmaxCrossEntropyWithLogits.

But the actual SoftmaxCrossEntropyWithLogits operation – additionaly checks only dimentions and do not perform any more computation.

Best Answer

Since Eigen's implementation the log function with base $e$, we know that $\log(e^{z_i}) = z_i$, and $\log({x \over y}) = \log x - \log y$, we have: $$ \begin{align} \log(a_i) &= \log{e^{z_i - \max(z)}\over \sum_{\forall j}{e^{z_j - \max(z)} }} \\ &= \log({e^{z_i - \max(x)}) - \log({\sum_{\forall j}{e^{z_j - \max(z)}}}}) \\ &= (z_i - \max(z)) - \log({\sum_{\forall j}{e^{z_j - \max(z)}}}) \end{align} $$

So $$ \begin{align} l & = -\sum_{\forall i} {y_i\log(a_i)} \\ & = -\sum_{\forall i} {y_i} \log{e^{z_i - \max(z)}\over \sum_{\forall j}{e^{z_j - \max(z)} }}\\ & = -\sum_{\forall i} {y_i} (z_i - \max(z)) - \log({\sum_{\forall j}{e^{z_j - \max(z)}}}) \end{align} $$

To answer your questions:

  • Where is the actual Softmax in their implementation?

Looking at my explanation above, you can see that the original formula of the Softmax has been transformed to $(z_i - \max(z)) - \log({\sum_{\forall j}{e^{z_j - \max(z)}}})$, and that is what tensorflow's implementation is. They are the same.

  • Why from each $z_j$ they subtract $ \log(\sum_{\forall i}{e^{z_i−\max(z)}}) $ before multiply it by $y_j$?

This is the result of the transformation

Note:

From Wikipedia, the exact formula for the cross entropy should have a minus sign before the sum: $$ l = -\sum_x{p(x)\log q(x)} $$