I understand that ReLUs are used in Neural Nets generally instead of sigmoid activation functions for the hidden layer. However, many commonly used ReLUs are not differentiable at zero. Gradient Descent (Stochastic or Batch) is quite often used to optimize these.
GD needs functions to be smooth and continuous. So I'm confused on how do ReLUs still work in the context of using GD for finding the global minima?
Best Answer
In practice, it's unlikely that one hidden unit has an input of precisely 0, so it doesn't matter much whether you take 0 or 1 for gradient in that situation. E.g. Theano considers that the gradient at 0 is 0. Tensorflow's playground does the same:
(1) did notice the theoretical issue of non-differentiability:
but it works anyway.
As a side note, if you use ReLU, you should watch for dead units in the network (= units that never activate). If you see to many dead units as you train your network, you might want to consider switching to leaky ReLU.