Solved – Non zero centered activation functions

backpropagationdeep learningneural networks

I read the following section from cs231n course notes:

  • Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on
    this soon) would be receiving data that is not zero-centered. This has
    implications on the dynamics during gradient descent, because if the
    data coming into a neuron is always positive (e.g. $x > 0$
    elementwise in $f = w^Tx + b$)), then the gradient on the weights
    $w$ will during backpropagation become either all be positive, or
    all negative (depending on the gradient of the whole expression
    $f$). This could introduce undesirable zig-zagging dynamics in the
    gradient updates for the weights. However, notice that once these
    gradients are added up across a batch of data the final update for the
    weights can have variable signs, somewhat mitigating this issue.
    Therefore, this is an inconvenience but it has less severe
    consequences compared to the saturated activation problem above.

I have understood why the gradients with respect to weights $w$ become all positive or negative during backpropagation since
$$\dfrac{\partial f}{\partial w_j}=x_j \text{ , and } \dfrac{\partial L}{\partial w_j}=\dfrac{\partial L}{\partial f}\dfrac{\partial f}{\partial w_j}=\dfrac{\partial L}{\partial f}x_j$$
Thus the gradient of $L$ with respect to weights are all positive or negative depending on the sign of $\frac{\partial L}{\partial f}$.

But I do not understand why it has implications on the dynamics during gradient descent. More precisely, why do we get 'zig-zag' gradient updates if the derivatives with respect to weights are all positive or all negative? Can you provide some intuitions and mathematical arguments to justify this?

Best Answer

If the gradients are all the same sign, all the weights will either have to increase, or decrease over one iteration. So based on the step length, if you overshoot in the + direction, all weights will have to adjust in the - direction in the next time step. I think the idea he is getting at is similar to what you see in steepest descent (see slide 9 of http://www.robots.ox.ac.uk/~az/lectures/opt/lect1.pdf).

Related Question