Ridge regression for multicollinearity and outliers

machine learningmulticollinearityoutliersregressionregularization

I'm wondering about techniques like ridge regression with regard to both multicollinearity and outliers.

My understanding is that ridge regression is primarily used for multicollinearity, but that somehow it is robust to outliers.

If you have a multicollinearity problem, what exactly is it that ridge regression gives you? Isn't the solution to remove the problematic variables? Does ridge regression simply make one of your covariates non-significant and signify that you should remove it?

If you have an outlier problem, what does it mean that ridge regression is robust to outliers? I tried it and using ridge regression gave me more outliers in terms of standardized residual diagnostics than OLS.

Best Answer

I understand the argument about dropping variables when they are correlated. After all, if variables are correlated, the information contained in one variable is partially contained in another. Including fewer variables results in fewer parameters in the model, and this can lead to a reduction in variance. Combined, I understand the appeal of dropping a variable: retain most of the information but drop a parameter.

The trouble is that, while you decrease the variance, you can introduce bias, perhaps enough that you are worse off, despite the decreased variance.

Regularization techniques are an alternative to dropping variables. They introduce bias, yes, but they do so in a way that does not completely drop variables. For instance, LASSO regularization tends to result in many coefficients calculated as $0$. If you consider that a feature selection step$^{\dagger}$ and run your regression on the "surviving" features that have nonzero coefficients, you will get different results, since (among other reasons) the "dead" features still contributed to the coefficient calculation.

Ridge regression does not set coefficients all the way to zero, and ridge regression seems to have a tendency to outperform LASSO.

None of these techniques---ridge, LASSO, and manually dropping variables---are inherently better than each other. In the link, Harrell gives his arguments for ridge regression but concedes that there are situations where LASSO can do better. If you have theoretical knowledge about the process or have a signal in the data screaming at you to drop a variable, perhaps that would work best.

$^{\dagger}$The link also discusses the issues with using LASSO to select features.

Related Question