Solved – Weight decay VS “model-capacity-reduction” regularization

machine learningneural networksregularization

In artificial neural networks, is weight decay regularization the same sort of regularization that would reducing the capacity of the model be?

I've learned that applying weight decay shrinks the weights towards smaller values, and this tends to avoid overfitting. Okey, I see that. In the figure below it is clear that the overfitted model has regions of large curvature, and this could be a consequence of having high values of the weights.

I've learned that reducing the model's capacity, i.e. reducing the number of layers or the number of units per layer, also avoids overfitting (depends also on the complexity of the task), because the more flexible the model the more it will tend to overfit. This can also be seen in the figure below, where the model that overfits chose that particular solution but could have chosen many others that would still fit the data, but not the underlying "data generation process" (expression that I don't get to fully grasp).

So again, my question is, are weight decay and model-capacity-reduction achieving the same sort of regularization? Or they achieve something different? And if so, how?

(Figure taken from www.deeplearningbook.org by Ian Goodfellow et al. )

enter image description here

Best Answer

Weight decay is actually a way of reducing model capacity. Model capacity measures how well a variety of functions can be fit. A way to reduce model capacity is to constrain the hypothesis space--that is, the set of all possible functions the learning algorithm can produce. In neural nets, this is the set of functions corresponding to every possible choice of parameters. Clearly, removing units and/or layers reduces this set. I'll explain why weight decay does too.

Weight decay is commonly expressed as penalty on the $\ell_2$ norm of the weights. Say $L(w)$ represents the error on the training set for weights $w$. The learning problem is:

$$\min_w \ L(w) + \lambda \|w\|^2$$

where the hyperparameter $\lambda$ controls the strength of weight decay. This problem has an equivalent, dual formulation, expressed in terms of a constraint rather than a penalty:

$$\min_w \ L(w) \quad \text{s.t. } \|w\| < c$$

This means that for every choice of $\lambda$, there's a corresponding choice of $c$ such that the two problems have the same solutions. The constraint formulation makes it clear that weight decay is constraining the solution to lie in a hypersphere of radius $c$. Increasing the strength of weight decay (i.e. increasing $\lambda$) corresponds to shrinking this hypersphere (i.e. decreasing $c$).

Because weight decay discards all sets of weights that lie outside the hypersphere, the corresponding functions are no longer part of the hypothesis space. So, weight decay reduces model capacity.

Related Question