Suppose I'm using dropout, and at test time I decide to do "weight scaling inference" (the method of predicting using the full network with weights multiplied by $p$, where $p$ is the probability of their origin node being included).
There's something that's always confused me about this. When a weight gets too large because of dropout training, it's not because its own origin node is being dropped out; it's because the other origin nodes are being dropped out. It's making itself bigger to compensate for the removal of its neighbors. So it seems like the more I drop out its neighbors, the larger it will grow, whereas how much I drop out its own source nodes is less relevant to how large it gets. And yet the amount that we scale it to compensate has to do with its own $p$ value, not that of its neighbors.
I know that in practice we mostly choose $p$ to be constant across layers, but still I'm interested to see if I'm understanding it correctly.
To give a concrete example, if I have two neurons at a layer, and neuron $A$ has with probability $p_A = .8$ of remaining, and neuron $B$ has probability $p_B = .05$ of remaining, it seems neuron $A$ is going to be the one to grow a lot bigger than it should be at test time. But I'm still only multiplying $p_A$ by $p_A$, not anything having to do with $p_B$.
So what's going on here?
Best Answer
The dropout paper motivates this procedure as matching the actual output under test conditions with the expected output under training conditions:
The paper also explains how this is related to the conceptualization of Dropout as training exponentially many "thinned" networks that share weights. By virtue of scaling by $p$ at test time, the network functions as an ensemble model that approximately averages the outputs of the thinned networks.
References:
Srivastava et al. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting.