Solved – Variance/bias trade off regularisation penalty – why does it take this form

bias-variance tradeoffloss-functionsmachine learningoptimization

In machine learning, if we estimate weights using a loss function

$$L(W) = ||Y-F_W(X)||^2$$

(where $W$ is a weight matrix) we may add a "regularisation penalty" to control for the "variance/bias trade-off" so that

$$L(W) = ||Y-F_W(X)||^2 + \lambda \phi(W).$$

Now, one of the penalties proposed frequently is of the form

$$\phi(W) = \sum_{i,j} |W_{ij}|^2$$

but I cannot find a good explanation of why such a penalty is used, including in the papers I have come across it.

Why does penalising larger values of $W$ affect the "bias/variance trade-off" of a model? Could someone give me an intuitive explanation for why we would prefer smaller values of $W_{ij}$ and why it's relevant to the bias/variance trade-off?

Best Answer

why such a penalty is used

It is used to penalize the complexity of model (function), or equivalently to encourage the model to be as simple as possible (roughly has fewer weights involved). This way, over-fitting is avoided and generalization of model to unseen data is improved.

Why does penalizing larger values of W affect the "bias/variance trade-off" of a model?

By forcing the parameters to be close to zero, the variance of model which is a function of its parameters would decrease, since parameters are less free to change compared to when their value is not penalized. On the other hand, the true function may need a wide range of weights to be estimated correctly, i.e. more flexible/complex model. However, by this penalty, we are allowing less flexibility (less complexity) thus moving the model away from the best configuration of weights required to estimate the true function, i.e. more bias.

Related Question