Solved – Why regularize all parameters in the same way

machine learningoverfittingregressionregularization

My question relates to regularization in linear regression and logistic regression. I'm currently doing week 3 of Andrew Ng's Machine Learning course on Coursera. I understand how overfitting can be a common problem and I have some intuition for how regularization can reduce overfitting. My question is can we improve our models by regularizing different parameters in different ways?


Example:

Let's say we're trying to fit $w_0 + w_1 x_1 + w_2 x_2 + w_3 x_3 + w_4 x_4$. This question is about why we penalize for high $w_1$ values in the same way that penalize for high $w_2$ values.

If we know nothing about how our features $(x_1,x_2,x_3,x_4)$ were constructed, it makes sense to treat them all in the same way when we do regularization: a high $w_1$ value should yield as much "penalty" as a high $w_3$ value.

But let's say we have additional information: let's say we only had 2 features originally: $x_1$ and $x_2$. A line was underfitting our training set and we wanted a more squiggly shaped decision boundary, so we constructed $x_3 = x_1^2$ and $x_4 = x_2^3$. Now we can have more complex models, but the more complex they get, the more we risk overfitting our model to the training data. So we want to strike a balance between minimizing the cost function and minimizing our model complexity. Well, the parameters that represent higher exponentials ($x_3$, $x_4$) are drastically increasing the complexity of our model. So shouldn't we penalize more for high $w_3$, $w_4$ values than we penalize for high $w_1,w_2$ values?

Best Answer

Well, the parameters that represent higher exponentials (x3,x4) are drasticly increasing the complexity of our model. So shouldn't we penalize more for high w3,w4 values than we penalize for high w1,w2 values?

The reason we say that adding quadratic or cubic terms increases model complexity is that it leads to a model with more parameters overall. We don't expect a quadratic term to be in and of itself more complex than a linear term. The one thing that's clear is that, all other things being equal, a model with more covariates is more complex.

For the purposes of regularization, one generally rescales all the covariates to have equal mean and variance so that, a priori, they are treated as equally important. If some covariates do in fact have a stronger relationship with the dependent variable than others, then, of course, the regularization procedure won't penalize those covariates as strongly, because they'll have greater contributions to the model fit.

But what if you really do think a priori that one covariate is more important than another, and you can quantify this belief, and you want the model to reflect it? Then what you probably want to do is use a Bayesian model and adjust the priors for the coefficients to match your preexisting belief. Not coincidentally, some familiar regularization procedures can be construed as special cases of Bayesian models. In particular, ridge regression is equivalent to a normal prior on the coefficients, and lasso regression is equivalent to a Laplacian prior.

Related Question