ReLU in Neural Networks – Why and How to Use ReLU in Neural Networks

neural networks

Why do we use rectified linear units (ReLU) with neural networks? How does that improve neural network?

Why do we say that ReLU is an activation function? Isn't softmax activation function for neural networks? I am guessing that we use both, ReLU and softmax, like this:

neuron 1 with softmax output —-> ReLU on the output of neuron 1, which is
input of neuron 2 —> neuron 2 with softmax output –> …

so that the input of neuron 2 is basically ReLU(softmax(x1)). Is this correct?

Best Answer

The ReLU function is $f(x)=\max(0, x).$ Usually this is applied element-wise to the output of some other function, such as a matrix-vector product. In MLP usages, rectifier units replace all other activation functions except perhaps the readout layer. But I suppose you could mix-and-match them if you'd like.

One way ReLUs improve neural networks is by speeding up training. The gradient computation is very simple (either 0 or 1 depending on the sign of $x$). Also, the computational step of a ReLU is easy: any negative elements are set to 0.0 -- no exponentials, no multiplication or division operations.

Gradients of logistic and hyperbolic tangent networks are smaller than the positive portion of the ReLU. This means that the positive portion is updated more rapidly as training progresses. However, this comes at a cost. The 0 gradient on the left-hand side is has its own problem, called "dead neurons," in which a gradient update sets the incoming values to a ReLU such that the output is always zero; modified ReLU units such as ELU (or Leaky ReLU, or PReLU, etc.) can ameliorate this.

$\frac{d}{dx}\text{ReLU}(x)=1\forall x > 0$ . By contrast, the gradient of a sigmoid unit is at most $0.25$; on the other hand, $\tanh$ fares better for inputs in a region near 0 since $0.25 < \frac{d}{dx}\tanh(x) \le 1 \forall x \in [-1.31, 1.31]$ (approximately).