Solved – Non zero centered activation functions

backpropagationdeep learningneural networks

I read the following section from cs231n course notes:

Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on
this soon) would be receiving data that is not zero-centered. This has
implications on the dynamics during gradient descent, because if the
data coming into a neuron is always positive (e.g. $x > 0$
elementwise in $f = w^Tx + b$)), then the gradient on the weights
$w$ will during backpropagation become either all be positive, or
all negative (depending on the gradient of the whole expression
$f$). This could introduce undesirable zig-zagging dynamics in the
gradient updates for the weights. However, notice that once these
gradients are added up across a batch of data the final update for the
weights can have variable signs, somewhat mitigating this issue.
Therefore, this is an inconvenience but it has less severe
consequences compared to the saturated activation problem above.

I have understood why the gradients with respect to weights $w$ become all positive or negative during backpropagation since
$$\dfrac{\partial f}{\partial w_j}=x_j \text{ , and } \dfrac{\partial L}{\partial w_j}=\dfrac{\partial L}{\partial f}\dfrac{\partial f}{\partial w_j}=\dfrac{\partial L}{\partial f}x_j$$
Thus the gradient of $L$ with respect to weights are all positive or negative depending on the sign of $\frac{\partial L}{\partial f}$.

But I do not understand why it has implications on the dynamics during gradient descent. More precisely, why do we get 'zig-zag' gradient updates if the derivatives with respect to weights are all positive or all negative? Can you provide some intuitions and mathematical arguments to justify this?

Best Answer

If the gradients are all the same sign, all the weights will either have to increase, or decrease over one iteration. So based on the step length, if you overshoot in the + direction, all weights will have to adjust in the - direction in the next time step. I think the idea he is getting at is similar to what you see in steepest descent (see slide 9 of http://www.robots.ox.ac.uk/~az/lectures/opt/lect1.pdf).

Notation:

I'll follow the notation in this made-up example of color classification:

whereby $j$ is the index denoting any of the $K$ output neurons - not necessarily the one corresponding to the true, ($t)$, value. Now,

$$\begin{align} o_j&=\sigma(j)=\sigma(z_j)=\text{softmax}(j)=\text{softmax (neuron }j)=\frac{e^{z_j}}{\displaystyle\sum_K e^{z_k}}\\[3ex] z_j &= \mathbf w_j^\top \mathbf x = \text{preactivation (neuron }j) \end{align}$$

The loss function is the negative log likelihood:

$$E = -\log \sigma(t) = -\log \left(\text{softmax}(t)\right)$$

The negative log likelihood is also known as the multiclass cross-entropy (ref: Pattern Recognition and Machine Learning Section 4.3.4), as they are in fact two different interpretations of the same formula.

Gradient of the loss function with respect to the pre-activation of an output neuron:

$$\begin{align} \frac{\partial E}{\partial z_j}&=\frac{\partial}{\partial z_j}\,-\log\left( \sigma(t)\right)\\[2ex] &= \frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\sigma(t)\\[2ex] &= \frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\sigma(z_j)\\[2ex] &= \frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\frac{e^{z_t}}{\displaystyle\sum_k e^{z_k}}\\[2ex] &= \frac{-1}{\sigma(t)}\quad\left[ \frac{\frac{\partial }{\partial z_j }e^{z_t}}{\displaystyle \sum_K e^{z_k}} \quad - \quad \frac{e^{z_t}\quad \frac{\partial}{\partial z_j}\displaystyle \sum_K e^{z_k}}{\left[\displaystyle\sum_K e^{z_k}\right]^2}\right]\\[2ex] &= \frac{-1}{\sigma(t)}\quad\left[ \frac{\delta_{jt}\;e^{z_t}}{\displaystyle \sum_K e^{z_k}} \quad - \quad \frac{e^{z_t}}{\displaystyle\sum_K e^{z_k}} \frac{e^{z_j}}{\displaystyle\sum_K e^{z_k}}\right]\\[2ex] &= \frac{-1}{\sigma(t)}\quad\left(\delta_{jt}\sigma(t) - \sigma(t)\sigma(j) \right)\\[2ex] &= - (\delta_{jt} - \sigma(j))\\[2ex] &= \sigma(j) - \delta_{jt} \end{align}$$

This is practically identical to $\frac{\partial E} {\partial z_j} = o_j - t_j$, and it does become identical if instead of focusing on $j$ as an individual output neuron, we transition to vectorial notation (as indicated in your question), and $t_j$ becomes the one-hot encoded vector of true values, which in my notation would be $\small \begin{bmatrix}0&0&0&\cdots&1&0&0&0_K\end{bmatrix}^\top$.

Then, with $\frac{\partial E} {\partial z_j} = o_j - t_j$ we are really calculating the gradient of the loss function with respect to the preactivation of all output neurons: the vector $t_j$ will contain a $1$ only in the neuron corresponding to the correct category, which is equivalent to the delta function $\delta_{jt}$, which is $1$ only when differentiating with respect to the pre-activation of the output neuron of the correct category.

In the Geoffrey Hinton's Coursera ML course the following chunk of code illustrates the implementation in Octave:

%% Compute derivative of cross-entropy loss function.
error_deriv = output_layer_state - expanded_target_batch;

The expanded_target_batch corresponds to the one-hot encoded sparse matrix with corresponding to the target of the training set. Hence, in the majority of the output neurons, the error_deriv = output_layer_state $(\sigma(j))$, because $\delta_{jt}$ is $0$, except for the neuron corresponding to the correct classification, in which case, a $1$ is going to be subtracted from $\sigma(j).$

The actual measurement of the cost is carried out with...

% MEASURE LOSS FUNCTION.
CE = -sum(sum(...
  expanded_target_batch .* log(output_layer_state + tiny))) / batchsize;

We see again the $\frac{\partial E}{\partial z_j}$ in the beginning of the backpropagation algorithm:

$$\small\frac{\partial E}{\partial W_{hidd-2-out}}=\frac{\partial \text{outer}_{input}}{\partial W_{hidd-2-out}}\, \frac{\partial E}{\partial \text{outer}_{input}}=\frac{\partial z_j}{\partial W_{hidd-2-out}}\, \frac{\partial E}{\partial z_j}$$

hid_to_output_weights_gradient =  hidden_layer_state * error_deriv';
output_bias_gradient = sum(error_deriv, 2);

since $z_j = \text{outer}_{in}= W_{hidd-2-out} \times \text{hidden}_{out}$

Observation re: OP additional questions:

The splitting of partials in the OP, $\frac{\partial E} {\partial z_j} = {\frac{\partial E} {\partial o_j}}{\frac{\partial o_j} {\partial z_j}}$, seems unwarranted.

The updating of the weights from hidden to output proceeds as...

hid_to_output_weights_delta = ...
 momentum .* hid_to_output_weights_delta + ...
 hid_to_output_weights_gradient ./ batchsize;
hid_to_output_weights = hid_to_output_weights...
 - learning_rate * hid_to_output_weights_delta;

which don't include the output $o_j$ in the OP formula: $w_{ij} = w'_{ij} - r{\frac{\partial E} {\partial z_j}} {o_i}.$ The formula would be more along the lines of...

$$W_{hidd-2-out}:=W_{hidd-2-out}-r\, \small \frac{\partial E}{\partial W_{hidd-2-out}}\, \Delta_{hidd-2-out}$$

Best Answer

Related Solutions

Backpropagation – Why Non Zero-Centered Activation Functions Cause Problems

Solved – Neural network softmax activation

Notation:

Gradient of the loss function with respect to the pre-activation of an output neuron:

Related Question