Solved – Does ReLU produce the same effect as dropouts

conv-neural-networkdropout

When we add dropouts to a densely connect layer, it randomly ignores nodes, by considering their output to always be zero.

Though we may not observe the exact same effect in a CNN with ReLU as its activation, isn't what happens somewhat similar to using dropout? When training images with enough noise won't random nodes be turned on and off during the training process? The only difference is, it also extends to the model we use after deployment.

Is there a proper way to establish this similarity or do both these methods solve completely different problems? Also, for a long time I have ignored the fact that ReLU is not actually differentiable at all points.

Best Answer

Dropout acts by, during training, randomly setting to zero some activations, while scaling the non-dropped ones.

ReLU sets to zero neurons which have a negative activation.

Notice that, while dropout selects neurons randomly, ReLU is deterministic. In other words, for the same input, and the same CNN weights, ReLU will always behave in the same way. Because of this, they have very distinct functions in the context of CNNs.

ReLU is used as an activation function, making the response of a neuron non-linear (it's easy to see that the ReLU function is non-linear). If models did not have these non-linear functions they would perform only linear combinations of the input (since a linear combination on the first layer followed by a linear combination on the second would still be a linear combination of the input). Thus, the introduction of non-linearities grants neural networks the ability to learn complex functions. The fact that it is non-differentiable at 0 is not a problem in practice as we can just assume its derivative to be either 0 (as for $x<0$) or 1 (as for $x>0$).

Dropout is a regularization method, which is used to avoid overfitting. It was developed later after the proposal of multi-layer neural networks (which have always used activation functions). There are different intuitions to why it works, but an easy one is that to classify a sample, the model needs to base that decision on multiple neurons rather than just one, as that neuron may be randomly blocked by dropout.