Solved – Weight scale inference when using dropout

dropoutregularization

Suppose I'm using dropout, and at test time I decide to do "weight scaling inference" (the method of predicting using the full network with weights multiplied by $p$, where $p$ is the probability of their origin node being included).

There's something that's always confused me about this. When a weight gets too large because of dropout training, it's not because its own origin node is being dropped out; it's because the other origin nodes are being dropped out. It's making itself bigger to compensate for the removal of its neighbors. So it seems like the more I drop out its neighbors, the larger it will grow, whereas how much I drop out its own source nodes is less relevant to how large it gets. And yet the amount that we scale it to compensate has to do with its own $p$ value, not that of its neighbors.

I know that in practice we mostly choose $p$ to be constant across layers, but still I'm interested to see if I'm understanding it correctly.

To give a concrete example, if I have two neurons at a layer, and neuron $A$ has with probability $p_A = .8$ of remaining, and neuron $B$ has probability $p_B = .05$ of remaining, it seems neuron $A$ is going to be the one to grow a lot bigger than it should be at test time. But I'm still only multiplying $p_A$ by $p_A$, not anything having to do with $p_B$.

So what's going on here?

Best Answer

The dropout paper motivates this procedure as matching the actual output under test conditions with the expected output under training conditions:

If a unit is retained with probability $p$ during training, the outgoing weights of that unit are multiplied by $p$ at test time as shown in Figure 2. This ensures that for any hidden unit the expected output (under the distribution used to drop units at training time) is the same as the actual output at test time.

The paper also explains how this is related to the conceptualization of Dropout as training exponentially many "thinned" networks that share weights. By virtue of scaling by $p$ at test time, the network functions as an ensemble model that approximately averages the outputs of the thinned networks.

References:

Srivastava et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting.

Related Question