I found rectified linear unit (ReLU) praised at several places as a solution to the vanishing gradient problem for neural networks. That is, one uses max(0,x) as activation function. When the activation is positive, it is obvious that this is better than, say, the sigmoid activation function, since its derivation is always 1 instead of an arbitrarily small value for large x. On the other hand, the derivation is exactly 0 when x is smaller than 0. In the worst case, when a unit is never activated, the weights for this unit would also never change anymore, and the unit would be forever useless – which seems much worse than even vanishingly small gradients. How do learning algorithms deal with that problem when they use ReLU?
Solved – How does rectilinear activation function solve the vanishing gradient problem in neural networks
deep learninggradient descentmachine learningneural networks
Best Answer
Here is a paper that explains the issue. I'm quoting some part of it to make the issue clear.
So rectifier activation function introduces sparsity effect on the network. Here are some advantages of sparsity from the same paper;
It also answers the question you've asked:
You can read the paper Deep Sparse Rectifier Neural Networks for more detail.