Cross entropy is defined on probability distributions, not single values. The reason it works for classification is that classifier output is (often) a probability distribution over class labels. For example, the outputs of logistic/softmax functions are interpreted as probabilities. The observed class label is also treated as a probability distribution: the empirical distribution (where the probability is 1 for the observed class and 0 for the others).
The concept of cross entropy applies equally well to continuous distributions. But, it can't be used for regression models that output a point estimate (e.g. the conditional mean) but not a full probability distribution. If you had a model that gave the full conditional distribution (probability of output given input), you could use cross entropy as a loss function.
For continuous distributions $p$ and $q$, the cross entropy is defined as:
$$H(p, q) = -\int_{Y} p(y) \log q(y) dy$$
Just considering a single observed input/output pair $(x, y)$, $p$ would be the empirical conditional distribution (a delta function over the observed output value), and $q$ would be the modeled conditional distribution (probability of output given input). In this case, the cross entropy reduces to $-\log q(y \mid x)$. Summing over data points, this is just the negative log likelihood!
Note: I am not an expert on backprop, but now having read a bit, I think the following caveat is appropriate. When reading papers or books on neural nets, it is not uncommon for derivatives to be written using a mix of the standard summation/index notation, matrix notation, and multi-index notation (include a hybrid of the last two for tensor-tensor derivatives). Typically the intent is that this should be "understood from context", so you have to be careful!
I noticed a couple of inconsistencies in your derivation. I do not do neural networks really, so the following may be incorrect. However, here is how I would go about the problem.
First, you need to take account of the summation in $E$, and you cannot assume each term only depends on one weight. So taking the gradient of $E$ with respect to component $k$ of $z$, we have
$$E=-\sum_jt_j\log o_j\implies\frac{\partial E}{\partial z_k}=-\sum_jt_j\frac{\partial \log o_j}{\partial z_k}$$
Then, expressing $o_j$ as
$$o_j=\tfrac{1}{\Omega}e^{z_j} \,,\, \Omega=\sum_ie^{z_i} \implies \log o_j=z_j-\log\Omega$$
we have
$$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-\frac{1}{\Omega}\frac{\partial\Omega}{\partial z_k}$$
where $\delta_{jk}$ is the Kronecker delta. Then the gradient of the softmax-denominator is
$$\frac{\partial\Omega}{\partial z_k}=\sum_ie^{z_i}\delta_{ik}=e^{z_k}$$
which gives
$$\frac{\partial \log o_j}{\partial z_k}=\delta_{jk}-o_k$$
or, expanding the log
$$\frac{\partial o_j}{\partial z_k}=o_j(\delta_{jk}-o_k)$$
Note that the derivative is with respect to $z_k$, an arbitrary component of $z$, which gives the $\delta_{jk}$ term ($=1$ only when $k=j$).
So the gradient of $E$ with respect to $z$ is then
$$\frac{\partial E}{\partial z_k}=\sum_jt_j(o_k-\delta_{jk})=o_k\left(\sum_jt_j\right)-t_k \implies \frac{\partial E}{\partial z_k}=o_k\tau-t_k$$
where $\tau=\sum_jt_j$ is constant (for a given $t$ vector).
This shows a first difference from your result: the $t_k$ no longer multiplies $o_k$. Note that for the typical case where $t$ is "one-hot" we have $\tau=1$ (as noted in your first link).
A second inconsistency, if I understand correctly, is that the "$o$" that is input to $z$ seems unlikely to be the "$o$" that is output from the softmax. I would think that it makes more sense that this is actually "further back" in network architecture?
Calling this vector $y$, we then have
$$z_k=\sum_iw_{ik}y_i+b_k \implies \frac{\partial z_k}{\partial w_{pq}}=\sum_iy_i\frac{\partial w_{ik}}{\partial w_{pq}}=\sum_iy_i\delta_{ip}\delta_{kq}=\delta_{kq}y_p$$
Finally, to get the gradient of $E$ with respect to the weight-matrix $w$, we use the chain rule
$$\frac{\partial E}{\partial w_{pq}}=\sum_k\frac{\partial E}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}=\sum_k(o_k\tau-t_k)\delta_{kq}y_p=y_p(o_q\tau-t_q)$$
giving the final expression (assuming a one-hot $t$, i.e. $\tau=1$)
$$\frac{\partial E}{\partial w_{ij}}=y_i(o_j-t_j)$$
where $y$ is the input on the lowest level (of your example).
So this shows a second difference from your result: the "$o_i$" should presumably be from the level below $z$, which I call $y$, rather than the level above $z$ (which is $o$).
Hopefully this helps. Does this result seem more consistent?
Update: In response to a query from the OP in the comments, here is an expansion of the first step.
First, note that the vector chain rule requires summations (see here). Second, to be certain of getting all gradient components, you should always introduce a new subscript letter for the component in the denominator of the partial derivative. So to fully write out the gradient with the full chain rule, we have
$$\frac{\partial E}{\partial w_{pq}}=\sum_i \frac{\partial E}{\partial o_i}\frac{\partial o_i}{\partial w_{pq}}$$
and
$$\frac{\partial o_i}{\partial w_{pq}}=\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}$$
so
$$\frac{\partial E}{\partial w_{pq}}=\sum_i \left[ \frac{\partial E}{\partial o_i}\left(\sum_k \frac{\partial o_i}{\partial z_k}\frac{\partial z_k}{\partial w_{pq}}\right) \right]$$
In practice the full summations reduce, because you get a lot of $\delta_{ab}$ terms. Although it involves a lot of perhaps "extra" summations and subscripts, using the full chain rule will ensure you always get the correct result.
Best Answer
Mathematically, softmax with finite inputs produces results $o_i \in (0,1) \forall i$ such that $\sum_i o_i =1$. This implies that softmax is never 0, so $\log(o_i)$ is always a real number.
Numerically, overflow or underflow could cause softmax to output a zero. This is common enough when training neural networks using floating point numbers. A common work-around to avoid numerical underflow (or overflow) is to work on the log scale via
log_softmax
, or else work on the logit scale and do not transform your outputs, but instead have a loss function defined on the logit scale. These methods avoid round-tripping (which causes a loss of precision) and use numerical tricks to keep values in nice floating point ranges.Obviously, working on the log scale, or the logit scale, requires making algebraic adjustments so that the loss is also on the appropriate scale. So if you use identity activations in the final layer, you use
CrossEntropyLoss
. If you uselog_softmax
in the final layer, you useNLLLoss
.Consider $0 < o_i < 1$ the probability output from the network, produced by softmax with finite input. We wish to compute the cross-entropy loss.
I can't answer the part of your question about re-labeling because it doesn't make sense. When you're using a numerically stable procedure, $\log(o_i)$ is always a finite number, so $t_i \log(o_i)$ for $y \in \{0,1\}$ is also finite. In fact, in the case of 1-hot labels, only one index $i$ has a non-zero value of $t_i \log (o_i)$.
See also: Infinities with cross entropy in practice