Solved – Why does SGD and back propagation work with ReLUs

backpropagationdeep learningneural networks

ReLUs are not differentiable at the origin. However, they are widely used in Deep Learning together with Stochastic Gradient Descent algorithms and Backpropagation, where the gradients of the loss function are calculated with the chain rule.

How do these algorithms calculate derivatives given that ReLUs are not differentiable at x=0 ?

Best Answer

At x = 0, the ReLU function is no longer differentiable, however it is sub-differentiable and any value in the range [0,1] is a valid choice of sub-gradient. You may see some implementations simply use 0 sub-gradient at the x = 0 singularity. For further details see the Wikipedia article: Subdervative.