Softmax cross entropy with logits is define as follows:
$a_i = \frac{e^{z_i}}{\sum_{\forall j} e^{z_j}}$
$l={\sum_{\forall i}}y_ilog(a_i)$
Where $l$ is the actual loss.
But when you look deep into C++ Tensorflow implementation of SoftmaxCrossEntropyWithLogits
operation, the exact formula which they use is descibed as:
$l={\sum_{\forall j}}y_j ((z_j-max(z))-log({\sum_{\forall i}}e^{z_i-max(z)}))$
The part: $z-max(z)$ – is perfectly understood – it is just normalization which helps to avoid under/overflow.
BUT:
-
Where is the actual
Softmax
in their implementation? -
Why from each $z_j$ they subtract $log({\sum_{\forall i}}e^{z_i-max(z)})$ before multiply it by $y_j$?
Note: One may argue that the code I indicated is just Tensorflow's implementation of CrossEntropyWithLogits
and not SoftmaxCrossEntropyWithLogits
.
But the actual SoftmaxCrossEntropyWithLogits
operation – additionaly checks only dimentions and do not perform any more computation.
Best Answer
Since Eigen's implementation the
log
function with base $e$, we know that $\log(e^{z_i}) = z_i$, and $\log({x \over y}) = \log x - \log y$, we have: $$ \begin{align} \log(a_i) &= \log{e^{z_i - \max(z)}\over \sum_{\forall j}{e^{z_j - \max(z)} }} \\ &= \log({e^{z_i - \max(x)}) - \log({\sum_{\forall j}{e^{z_j - \max(z)}}}}) \\ &= (z_i - \max(z)) - \log({\sum_{\forall j}{e^{z_j - \max(z)}}}) \end{align} $$So $$ \begin{align} l & = -\sum_{\forall i} {y_i\log(a_i)} \\ & = -\sum_{\forall i} {y_i} \log{e^{z_i - \max(z)}\over \sum_{\forall j}{e^{z_j - \max(z)} }}\\ & = -\sum_{\forall i} {y_i} (z_i - \max(z)) - \log({\sum_{\forall j}{e^{z_j - \max(z)}}}) \end{align} $$
To answer your questions:
Looking at my explanation above, you can see that the original formula of the Softmax has been transformed to $(z_i - \max(z)) - \log({\sum_{\forall j}{e^{z_j - \max(z)}}})$, and that is what tensorflow's implementation is. They are the same.
This is the result of the transformation
Note:
From Wikipedia, the exact formula for the cross entropy should have a minus sign before the sum: $$ l = -\sum_x{p(x)\log q(x)} $$