$$f=\sum w_ix_i+b$$ $$\frac{df}{dw_i}=x_i$$ $$\frac{dL}{dw_i}=\frac{dL}{df}\frac{df}{dw_i}=\frac{dL}{df}x_i$$
because $x_i>0$, the gradient $\dfrac{dL}{dw_i}$ always has the same sign as $\dfrac{dL}{df}$ (all positive or all negative).
Update
Say there are two parameters $w_1$ and $w_2$. If the gradients of two dimensions are always of the same sign (i.e., either both are positive or both are negative), it means we can only move roughly in the direction of northeast or southwest in the parameter space.
If our goal happens to be in the northwest, we can only move in a zig-zagging fashion to get there, just like parallel parking in a narrow space. (forgive my drawing)
Therefore all-positive or all-negative activation functions (relu, sigmoid) can be difficult for gradient based optimization. To solve this problem we can normalize the data in advance to be zero-centered as in batch/layer normalization.
Also another solution I can think of is to add a bias term for each input so the layer becomes
$$f=\sum w_i(x_i+b_i).$$
The gradients is then
$$\frac{dL}{dw_i}=\frac{dL}{df}(x_i-b_i)$$
the sign won't solely depend on $x_i$.
The internet has told me that when using Softmax combined with cross entropy, Step 1 simply becomes $\frac{\partial E} {\partial z_j} = o_j - t_j$ where $t$ is a one-hot encoded target output vector. Is this correct?
Yes. Before going through the proof, let me change the notation to avoid careless mistakes in translation:
Notation:
I'll follow the notation in this made-up example of color classification:
whereby $j$ is the index denoting any of the $K$ output neurons - not necessarily the one corresponding to the true, ($t)$, value. Now,
$$\begin{align} o_j&=\sigma(j)=\sigma(z_j)=\text{softmax}(j)=\text{softmax (neuron }j)=\frac{e^{z_j}}{\displaystyle\sum_K e^{z_k}}\\[3ex]
z_j &= \mathbf w_j^\top \mathbf x = \text{preactivation (neuron }j)
\end{align}$$
The loss function is the negative log likelihood:
$$E = -\log \sigma(t) = -\log \left(\text{softmax}(t)\right)$$
The negative log likelihood is also known as the multiclass cross-entropy (ref: Pattern Recognition and Machine Learning Section 4.3.4), as they are in fact two different interpretations of the same formula.
Gradient of the loss function with respect to the pre-activation of an output neuron:
$$\begin{align}
\frac{\partial E}{\partial z_j}&=\frac{\partial}{\partial z_j}\,-\log\left( \sigma(t)\right)\\[2ex]
&=
\frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\sigma(t)\\[2ex]
&=
\frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\sigma(z_j)\\[2ex]
&=
\frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\frac{e^{z_t}}{\displaystyle\sum_k e^{z_k}}\\[2ex]
&= \frac{-1}{\sigma(t)}\quad\left[ \frac{\frac{\partial }{\partial z_j }e^{z_t}}{\displaystyle \sum_K e^{z_k}}
\quad - \quad
\frac{e^{z_t}\quad \frac{\partial}{\partial z_j}\displaystyle \sum_K e^{z_k}}{\left[\displaystyle\sum_K e^{z_k}\right]^2}\right]\\[2ex]
&= \frac{-1}{\sigma(t)}\quad\left[ \frac{\delta_{jt}\;e^{z_t}}{\displaystyle \sum_K e^{z_k}}
\quad - \quad \frac{e^{z_t}}{\displaystyle\sum_K e^{z_k}}
\frac{e^{z_j}}{\displaystyle\sum_K e^{z_k}}\right]\\[2ex]
&= \frac{-1}{\sigma(t)}\quad\left(\delta_{jt}\sigma(t) - \sigma(t)\sigma(j) \right)\\[2ex]
&= - (\delta_{jt} - \sigma(j))\\[2ex]
&= \sigma(j) - \delta_{jt}
\end{align}$$
This is practically identical to $\frac{\partial E} {\partial z_j} = o_j - t_j$, and it does become identical if instead of focusing on $j$ as an individual output neuron, we transition to vectorial notation (as indicated in your question), and $t_j$ becomes the one-hot encoded vector of true values, which in my notation would be $\small \begin{bmatrix}0&0&0&\cdots&1&0&0&0_K\end{bmatrix}^\top$.
Then, with $\frac{\partial E} {\partial z_j} = o_j - t_j$ we are really calculating the gradient of the loss function with respect to the preactivation of all output neurons: the vector $t_j$ will contain a $1$ only in the neuron corresponding to the correct category, which is equivalent to the delta function $\delta_{jt}$, which is $1$ only when differentiating with respect to the pre-activation of the output neuron of the correct category.
In the Geoffrey Hinton's Coursera ML course the following chunk of code illustrates the implementation in Octave:
%% Compute derivative of cross-entropy loss function.
error_deriv = output_layer_state - expanded_target_batch;
The expanded_target_batch
corresponds to the one-hot encoded sparse matrix with corresponding to the target of the training set. Hence, in the majority of the output neurons, the error_deriv = output_layer_state
$(\sigma(j))$, because $\delta_{jt}$ is $0$, except for the neuron corresponding to the correct classification, in which case, a $1$ is going to be subtracted from $\sigma(j).$
The actual measurement of the cost is carried out with...
% MEASURE LOSS FUNCTION.
CE = -sum(sum(...
expanded_target_batch .* log(output_layer_state + tiny))) / batchsize;
We see again the $\frac{\partial E}{\partial z_j}$ in the beginning of the backpropagation algorithm:
$$\small\frac{\partial E}{\partial W_{hidd-2-out}}=\frac{\partial \text{outer}_{input}}{\partial W_{hidd-2-out}}\, \frac{\partial E}{\partial \text{outer}_{input}}=\frac{\partial z_j}{\partial W_{hidd-2-out}}\, \frac{\partial E}{\partial z_j}$$
in
hid_to_output_weights_gradient = hidden_layer_state * error_deriv';
output_bias_gradient = sum(error_deriv, 2);
since $z_j = \text{outer}_{in}= W_{hidd-2-out} \times \text{hidden}_{out}$
Observation re: OP additional questions:
The splitting of partials in the OP, $\frac{\partial E} {\partial z_j} = {\frac{\partial E} {\partial o_j}}{\frac{\partial o_j} {\partial z_j}}$, seems unwarranted.
The updating of the weights from hidden to output proceeds as...
hid_to_output_weights_delta = ...
momentum .* hid_to_output_weights_delta + ...
hid_to_output_weights_gradient ./ batchsize;
hid_to_output_weights = hid_to_output_weights...
- learning_rate * hid_to_output_weights_delta;
which don't include the output $o_j$ in the OP formula: $w_{ij} = w'_{ij} - r{\frac{\partial E} {\partial z_j}} {o_i}.$
The formula would be more along the lines of...
$$W_{hidd-2-out}:=W_{hidd-2-out}-r\,
\small \frac{\partial E}{\partial W_{hidd-2-out}}\, \Delta_{hidd-2-out}$$
Best Answer
If the gradients are all the same sign, all the weights will either have to increase, or decrease over one iteration. So based on the step length, if you overshoot in the + direction, all weights will have to adjust in the - direction in the next time step. I think the idea he is getting at is similar to what you see in steepest descent (see slide 9 of http://www.robots.ox.ac.uk/~az/lectures/opt/lect1.pdf).