I understood the feedforward part of dropout during training, where for each example I multiply each activation with a binary mask to de-activate neurons with probability p.
I use the inverted approach in which I divide all activations that are not zero by (1-p).
p = probability of dropping out a unit
a = activations of a hidden layer for a mini-batch
a = a * dropout_mask / (1-p)
So the dropout_mask is not made of 1s and 0s, but of 2s and 0s if p=0.5. In this way there is no need to scale down the activations at test time.
What I don't understand is how should I compute the gradient with backpropagation? Should I keep the same mask with 0s and 2s or should it be binary again?
Best Answer
If we rewrite code as $ b = a*mask / (1-p)$, the derivatives for backpropagation $$\frac{\partial b}{\partial a}=mask / (1-p),$$ which should be 0s and 2s.
I think it might be more helpful to not see it as
a = a * (dropout_mask/(1-p))
(applying the scaling to the mask), but asa = (a*dropout_mask) / (1-p)
(applying the scaling to the masked input).Then this should be something like $ c = a*mask, b= c / (1-p)$, and we have $$\frac{\partial b}{\partial a}=\frac{\partial b}{\partial c}\frac{\partial c}{\partial a}=\frac{1}{1-p}mask,$$ which is actually the same but maybe this way we can worry less about the value of the masks.