Backpropagation – Why Non Zero-Centered Activation Functions Cause Problems

backpropagationdeep learningneural networks

I read here the following:

  • Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on
    this soon) would be receiving data that is not zero-centered. This has
    implications on the dynamics during gradient descent, because if the
    data coming into a neuron is always positive (e.g. $x > 0$
    elementwise in $f = w^Tx + b$)), then the gradient on the weights
    $w$ will during backpropagation become either all be positive, or
    all negative (depending on the gradient of the whole expression
    $f$). This could introduce undesirable zig-zagging dynamics in the
    gradient updates for the weights. However, notice that once these
    gradients are added up across a batch of data the final update for the
    weights can have variable signs, somewhat mitigating this issue.
    Therefore, this is an inconvenience but it has less severe
    consequences compared to the saturated activation problem above.

Why would having all $x>0$ (elementwise) lead to all-positive or all-negative gradients on $w$?


Best Answer

$$f=\sum w_ix_i+b$$ $$\frac{df}{dw_i}=x_i$$ $$\frac{dL}{dw_i}=\frac{dL}{df}\frac{df}{dw_i}=\frac{dL}{df}x_i$$

because $x_i>0$, the gradient $\dfrac{dL}{dw_i}$ always has the same sign as $\dfrac{dL}{df}$ (all positive or all negative).

Update
Say there are two parameters $w_1$ and $w_2$. If the gradients of two dimensions are always of the same sign (i.e., either both are positive or both are negative), it means we can only move roughly in the direction of northeast or southwest in the parameter space.

If our goal happens to be in the northwest, we can only move in a zig-zagging fashion to get there, just like parallel parking in a narrow space. (forgive my drawing)

enter image description here

Therefore all-positive or all-negative activation functions (relu, sigmoid) can be difficult for gradient based optimization. To solve this problem we can normalize the data in advance to be zero-centered as in batch/layer normalization.

Also another solution I can think of is to add a bias term for each input so the layer becomes $$f=\sum w_i(x_i+b_i).$$ The gradients is then $$\frac{dL}{dw_i}=\frac{dL}{df}(x_i-b_i)$$ the sign won't solely depend on $x_i$.

Related Question