Solved – Why activation functions that approximate the identity near origin are preferable

deep learningmachine learning

From Aghdam, H. H., & Heravi, E. J. (2017). Guide to Convolutional Neural Networks:

it is also desirable that the activation function approximates the identity mapping near origin. To explain this, we should consider the
activation of a neuron. Formally, the activation of a neuron is given by G (wx T + b)
where G is the activation function. Usually, the weight vector w and bias b are
initialized with values close to zero by the gradient descend method. Consequently, wx T + b will be close to zero. If G approximates the identity function
near zero, its gradient will be approximately equal to its input. In other words,
δ G ≈ wx T + b ⇐⇒ wx T + b ≈ 0. In terms of the gradient descend, it is a strong
gradient which helps the training algorithm to converge faster.

https://en.wikipedia.org/wiki/Activation_function also quotes them saying:

Approximates identity near the origin: When activation functions have this property, the neural network will learn efficiently when its weights are initialized with small random values. When the activation function does not approximate identity near the origin, special care must be used when initializing the weights.

but I don't understand why the gradient approximates the input near zero. I think it should be a constant, maybe w. Can someone please explain this step by step?

Best Answer

I'll break down the answer into 2 claims:

  1. If the network is initialized with near zero values ($w,b \approx 0$) then the gradient of each activation will also be approximately zero.
  2. Near-zero gradients are desirable, in terms of learning efficiency

The proof of the second claim is outside the scope of your question. In short, having bounded gradients at each layer is desirable, because gradients that are too small or too large ("exploding gradients", "vanishing gradients") are likely to interfere with the optimization procedure. Many works in the last year or two are around methods of solving this problem, probably the most famous one is the Batch Normalization scheme.

We'll prove the first claim. Denote as $x$ a neuron's input. Then its output is $G(w^T x)$, where $G$ is our non-linear activation function. For convenience, denote $z=w^T x$ and $y=G(z)$.

Since $w \approx 0$ we have that

$\frac{\partial z}{\partial x}=\frac{\partial (w^T x)}{\partial x}=w^T x\approx 0 $

And if we assume that $G$ is approximately the identity around zero,

$\frac{\partial G}{\partial x}=1$

Plugging these two results into the chain rule:

$\frac{\partial G}{\partial x}= \frac{\partial G}{\partial z} \cdot \frac{\partial z}{\partial x} \approx 0\cdot 1 = 0$