Backpropagation – Implementing Dropout Backpropagation for Neural Networks

backpropagationdropoutneural networks

I understood the feedforward part of dropout during training, where for each example I multiply each activation with a binary mask to de-activate neurons with probability p.

I use the inverted approach in which I divide all activations that are not zero by (1-p).

    p = probability of dropping out a unit
    a = activations of a hidden layer for a mini-batch 

    a = a * dropout_mask / (1-p)

So the dropout_mask is not made of 1s and 0s, but of 2s and 0s if p=0.5. In this way there is no need to scale down the activations at test time.

What I don't understand is how should I compute the gradient with backpropagation? Should I keep the same mask with 0s and 2s or should it be binary again?

Best Answer

If we rewrite code as $ b = a*mask / (1-p)$, the derivatives for backpropagation $$\frac{\partial b}{\partial a}=mask / (1-p),$$ which should be 0s and 2s.

I think it might be more helpful to not see it as a = a * (dropout_mask/(1-p)) (applying the scaling to the mask), but as a = (a*dropout_mask) / (1-p) (applying the scaling to the masked input).

Then this should be something like $ c = a*mask, b= c / (1-p)$, and we have $$\frac{\partial b}{\partial a}=\frac{\partial b}{\partial c}\frac{\partial c}{\partial a}=\frac{1}{1-p}mask,$$ which is actually the same but maybe this way we can worry less about the value of the masks.

Related Question