Solved – How does rectilinear activation function solve the vanishing gradient problem in neural networks

deep learninggradient descentmachine learningneural networks

I found rectified linear unit (ReLU) praised at several places as a solution to the vanishing gradient problem for neural networks. That is, one uses max(0,x) as activation function. When the activation is positive, it is obvious that this is better than, say, the sigmoid activation function, since its derivation is always 1 instead of an arbitrarily small value for large x. On the other hand, the derivation is exactly 0 when x is smaller than 0. In the worst case, when a unit is never activated, the weights for this unit would also never change anymore, and the unit would be forever useless – which seems much worse than even vanishingly small gradients. How do learning algorithms deal with that problem when they use ReLU?

Best Answer

Here is a paper that explains the issue. I'm quoting some part of it to make the issue clear.

The rectifier activation function allows a network to easily obtain sparse representations. For example, after uniform initialization of the weights, around 50% of hidden units continuous output values are real zeros, and this fraction can easily increase with sparsity-inducing regularization.

So rectifier activation function introduces sparsity effect on the network. Here are some advantages of sparsity from the same paper;

  • Information disentangling. One of the claimed objectives of deep learning algorithms (Bengio,2009) is to disentangle the factors explaining the variations in the data. A dense representation is highly entangled because almost any change in the input modifies most of the entries in the representation vector. Instead, if a representation is both sparse and robust to small input changes, the set of non-zero features is almost always roughly conserved by small changes of the input.

  • Efficient variable-size representation. Different inputs may contain different amounts of information and would be more conveniently represented using a variable-size data-structure, which is common in computer representations of information. Varying the number of active neurons allows a model to control the effective dimensionality of the representation for a given input and the required precision.

  • Linear separability. Sparse representations are also more likely to be linearly separable, or more easily separable with less non-linear machinery, simply because the information is represented in a high-dimensional space. Besides, this can reflect the original data format. In text-related applications for instance, the original raw data is already very sparse.

  • Distributed but sparse. Dense distributed representations are the richest representations, being potentially exponentially more efficient than purely local ones (Bengio, 2009). Sparse representations’ efficiency is still exponentially greater, with the power of the exponent being the number of non-zero features. They may represent a good trade-off with respect to the above criteria.

It also answers the question you've asked:

One may hypothesize that the hard saturation at 0 may hurt optimization by blocking gradient back-propagation. To evaluate the potential impact of this effect we also investigate the softplus activation: $ \text{softplus}(x) = \log(1 + e^x) $ (Dugas et al., 2001), a smooth version of the rectifying non-linearity. We lose the exact sparsity, but may hope to gain easier training. However, experimental results tend to contradict that hypothesis, suggesting that hard zeros can actually help supervised training. We hypothesize that the hard non-linearities do not hurt so long as the gradient can propagate along some paths, i.e., that some of the hidden units in each layer are non-zero With the credit and blame assigned to these ON units rather than distributed more evenly, we hypothesize that optimization is easier.

You can read the paper Deep Sparse Rectifier Neural Networks for more detail.