Solved – Why would ReLU work as an activation function at all

deep learningintuitionmachine learningneural networks

When I first started out learning neural-networks, I tried to get intuition for why they work with logit activation functions. I pictured each "neuron" as doing a logistic regression on the layer below in order to model the binomial distribution of "Is the feature that this neuron represents present, given the layer below? 1 for yes, 0 for no." Through gradient descent, each neuron converges on a feature that is most useful for the network to recognize.

When moving on to other activation functions, particularly the ReLU, my intuition falls apart, because now you're not doing logistic regression on the layer below. You're no longer using the output below to model a binomial distribution. So what are you really doing? How does the ReLU activation still "recognize" features that are lower in the hierarchy?

Best Answer

Imagine running linear regression when you expect the results to always be positive. Therefore, even if the prediction is negative, you set it to 0 to get a valid output, so effectively, $y = \text{relu}(w^Tx)$. Now if you simply "stack" these linear regression units in the same way you "stack" logistic regression to get a neural network, you end up with a neural network using relu units.

Another way to see why relu works is to drop the idea of sigmoid units doing logistic regression -- because they're not really doing logistic regression in any traditional sense. Instead, the sum total of the neural network is acting as a powerful function approximator. It has been shown that a neural network of sufficient size can approximate almost any function arbitrarily well. We want to train our neural network to approximate the function which maps the inputs to the correct outputs.

When you think about a neural network as a function approximator, it makes sense that relu works just as well as sigmoid -- they both play the role of introducing non-linearities into the network (which is required for the universal approximation theorem to hold).

To sum it up, you can replace logistic regression with a modified form of linear regression to satisfy your intuition. However, viewing the network as a function approximator may be a better way to see how neural networks work.

Related Question