Solved – ReLUs and Gradient Descent for Deep Neural Nets

deep learninggradient descentlinear modelneural networks

I understand that ReLUs are used in Neural Nets generally instead of sigmoid activation functions for the hidden layer. However, many commonly used ReLUs are not differentiable at zero. Gradient Descent (Stochastic or Batch) is quite often used to optimize these.

GD needs functions to be smooth and continuous. So I'm confused on how do ReLUs still work in the context of using GD for finding the global minima?

Best Answer

In practice, it's unlikely that one hidden unit has an input of precisely 0, so it doesn't matter much whether you take 0 or 1 for gradient in that situation. E.g. Theano considers that the gradient at 0 is 0. Tensorflow's playground does the same:

public static RELU: ActivationFunction = {
    output: x => Math.max(0, x),
    der: x => x <= 0 ? 0 : 1
  };

(1) did notice the theoretical issue of non-differentiability:

This paper shows that rectifying neurons are an even better model of biological neurons and yield equal or better performance than hyperbolic tangent networks in spite of the hard non-linearity and non-differentiability at zero, creating sparse representations with true zeros, which seem remarkably suitable for naturally sparse data.

but it works anyway.

As a side note, if you use ReLU, you should watch for dead units in the network (= units that never activate). If you see to many dead units as you train your network, you might want to consider switching to leaky ReLU.

(1) Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Deep Sparse Rectifier Neural Networks." In Aistats, vol. 15, no. 106, p. 275. 2011.

Best Answer

Related Solutions

Solved – Does a Neural Network actually need an activation function or is that just for Back Propagation

Solved – From the Perceptron rule to Gradient Descent: How are Perceptrons with a sigmoid activation function different from Logistic Regression

Related Question