The internet has told me that when using Softmax combined with cross entropy, Step 1 simply becomes $\frac{\partial E} {\partial z_j} = o_j - t_j$ where $t$ is a one-hot encoded target output vector. Is this correct?
Yes. Before going through the proof, let me change the notation to avoid careless mistakes in translation:
Notation:
I'll follow the notation in this made-up example of color classification:
whereby $j$ is the index denoting any of the $K$ output neurons - not necessarily the one corresponding to the true, ($t)$, value. Now,
$$\begin{align} o_j&=\sigma(j)=\sigma(z_j)=\text{softmax}(j)=\text{softmax (neuron }j)=\frac{e^{z_j}}{\displaystyle\sum_K e^{z_k}}\\[3ex]
z_j &= \mathbf w_j^\top \mathbf x = \text{preactivation (neuron }j)
\end{align}$$
The loss function is the negative log likelihood:
$$E = -\log \sigma(t) = -\log \left(\text{softmax}(t)\right)$$
The negative log likelihood is also known as the multiclass cross-entropy (ref: Pattern Recognition and Machine Learning Section 4.3.4), as they are in fact two different interpretations of the same formula.
Gradient of the loss function with respect to the pre-activation of an output neuron:
$$\begin{align}
\frac{\partial E}{\partial z_j}&=\frac{\partial}{\partial z_j}\,-\log\left( \sigma(t)\right)\\[2ex]
&=
\frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\sigma(t)\\[2ex]
&=
\frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\sigma(z_j)\\[2ex]
&=
\frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\frac{e^{z_t}}{\displaystyle\sum_k e^{z_k}}\\[2ex]
&= \frac{-1}{\sigma(t)}\quad\left[ \frac{\frac{\partial }{\partial z_j }e^{z_t}}{\displaystyle \sum_K e^{z_k}}
\quad - \quad
\frac{e^{z_t}\quad \frac{\partial}{\partial z_j}\displaystyle \sum_K e^{z_k}}{\left[\displaystyle\sum_K e^{z_k}\right]^2}\right]\\[2ex]
&= \frac{-1}{\sigma(t)}\quad\left[ \frac{\delta_{jt}\;e^{z_t}}{\displaystyle \sum_K e^{z_k}}
\quad - \quad \frac{e^{z_t}}{\displaystyle\sum_K e^{z_k}}
\frac{e^{z_j}}{\displaystyle\sum_K e^{z_k}}\right]\\[2ex]
&= \frac{-1}{\sigma(t)}\quad\left(\delta_{jt}\sigma(t) - \sigma(t)\sigma(j) \right)\\[2ex]
&= - (\delta_{jt} - \sigma(j))\\[2ex]
&= \sigma(j) - \delta_{jt}
\end{align}$$
This is practically identical to $\frac{\partial E} {\partial z_j} = o_j - t_j$, and it does become identical if instead of focusing on $j$ as an individual output neuron, we transition to vectorial notation (as indicated in your question), and $t_j$ becomes the one-hot encoded vector of true values, which in my notation would be $\small \begin{bmatrix}0&0&0&\cdots&1&0&0&0_K\end{bmatrix}^\top$.
Then, with $\frac{\partial E} {\partial z_j} = o_j - t_j$ we are really calculating the gradient of the loss function with respect to the preactivation of all output neurons: the vector $t_j$ will contain a $1$ only in the neuron corresponding to the correct category, which is equivalent to the delta function $\delta_{jt}$, which is $1$ only when differentiating with respect to the pre-activation of the output neuron of the correct category.
In the Geoffrey Hinton's Coursera ML course the following chunk of code illustrates the implementation in Octave:
%% Compute derivative of cross-entropy loss function.
error_deriv = output_layer_state - expanded_target_batch;
The expanded_target_batch
corresponds to the one-hot encoded sparse matrix with corresponding to the target of the training set. Hence, in the majority of the output neurons, the error_deriv = output_layer_state
$(\sigma(j))$, because $\delta_{jt}$ is $0$, except for the neuron corresponding to the correct classification, in which case, a $1$ is going to be subtracted from $\sigma(j).$
The actual measurement of the cost is carried out with...
% MEASURE LOSS FUNCTION.
CE = -sum(sum(...
expanded_target_batch .* log(output_layer_state + tiny))) / batchsize;
We see again the $\frac{\partial E}{\partial z_j}$ in the beginning of the backpropagation algorithm:
$$\small\frac{\partial E}{\partial W_{hidd-2-out}}=\frac{\partial \text{outer}_{input}}{\partial W_{hidd-2-out}}\, \frac{\partial E}{\partial \text{outer}_{input}}=\frac{\partial z_j}{\partial W_{hidd-2-out}}\, \frac{\partial E}{\partial z_j}$$
in
hid_to_output_weights_gradient = hidden_layer_state * error_deriv';
output_bias_gradient = sum(error_deriv, 2);
since $z_j = \text{outer}_{in}= W_{hidd-2-out} \times \text{hidden}_{out}$
Observation re: OP additional questions:
The splitting of partials in the OP, $\frac{\partial E} {\partial z_j} = {\frac{\partial E} {\partial o_j}}{\frac{\partial o_j} {\partial z_j}}$, seems unwarranted.
The updating of the weights from hidden to output proceeds as...
hid_to_output_weights_delta = ...
momentum .* hid_to_output_weights_delta + ...
hid_to_output_weights_gradient ./ batchsize;
hid_to_output_weights = hid_to_output_weights...
- learning_rate * hid_to_output_weights_delta;
which don't include the output $o_j$ in the OP formula: $w_{ij} = w'_{ij} - r{\frac{\partial E} {\partial z_j}} {o_i}.$
The formula would be more along the lines of...
$$W_{hidd-2-out}:=W_{hidd-2-out}-r\,
\small \frac{\partial E}{\partial W_{hidd-2-out}}\, \Delta_{hidd-2-out}$$
$$\nabla L = \begin{pmatrix}
\frac{\partial L}{\partial w_1} \\
\frac{\partial L}{\partial w_2} \\
\vdots \\
\frac{\partial L}{\partial w_n} \end{pmatrix}$$
This requires computing the derivatives of the terms like
$$\log {1 \over {1+e^{-\vec x \cdot \vec w}}} = \log {1 \over {1+e^{-(x_1 \cdot w_1 + x_2 \cdot w_2 + \, \dots \, + x_n \cdot w_n)}}}$$
where you can use
$$\frac{\partial}{\partial x} \left( \log\frac{1}{1+e^{-(a+bx)}} \right) = \frac{b}{1+e^{(a+bx)}}$$
and
$$\frac{\partial}{\partial x} \left( \log(1-\frac{1}{1+e^{-(a+bx)}} ) \right) = \frac{b}{1-e^{-(a+bx)}}$$
Filling that in you get
$$\frac{\partial}{\partial w_j} L = \frac{\bar{y_i} x_j}{{1+e^{\vec x \cdot \vec w}}} - \frac{(1-\bar{y_i}) x_j }{{1-e^{-\vec x \cdot \vec w}}}$$
and
$$\nabla L = \left( \frac{\bar{y_i}}{{1+e^{\vec x \cdot \vec w}}} - \frac{(1-\bar{y_i}) }{{1-e^{-\vec x \cdot \vec w}}} \right) \vec{x} $$
Best Answer
There is indeed a mistake:\begin{align} \frac{\partial E_x}{\partial o_j^x} &=\frac{\partial }{\partial o_j^x} \left( - \sum_{k}[t_k^x \log(o_k^x)] + (1-t_k^x) \log(1-o_k^x)]\right) \\ &=-\frac{\partial }{\partial o_j^x} \left( \sum_{k}[t_k^x \log(o_k^x)] + (1-t_k^x) \log(1-o_k^x)]\right) \\ &=-\frac{\partial }{\partial o_j^x} \left( [t_j^x \log(o_j^x)] + (1-t_j^x) \log(1-o_j^x)]\right) \\ &=- \left( \frac{t_j^x}{o_j^x} - \frac{1-t_j^x}{1-o_j^x}\right), \text{Chain rule} \\ &=- \frac{t_j^x}{o_j^x} + \frac{1-t_j^x}{1-o_j^x} \\ \end{align}