Solved – Regularization when you need predictor selection with high collinearity

lassomulticollinearityregularizationridge regression

I have two questions about regularization:

  1. In the presence of high collinearity, ridge is better than Lasso, but if you need predictor selection, ridge is not what you want. In those cases, should you still use Lasso or is there any alternative (e.g., subset selection)?

  2. Can you use regularization on every predictive methods (regression, classification)?

Best Answer

In the presence of high collinearity, ridge is better than Lasso, but if you need predictor selection, ridge is not what you want. In those cases, should you still use Lasso or is there any alternative (e.g., subset selection)?

Yes, there is an alternative that combines ridge and LASSO together called Elastic net. This minimizes the loss function:

$$ L = \sum_i (y_i - \hat y_i)^2 + \lambda \left( \alpha \sum_j \| \beta_j \| + (1 - \alpha) \frac{1}{2} \sum_j \beta_j^2 \right) $$

Here, $\lambda$ controls the overall regularization strength, and $\alpha$ is a number between zero and one (inclusive) that adjusts the relative strengths of the ridge vs. LASSO penalization.

I do not know of a situation where best subset selection is appropriate which does not arise from software or computing environment constraints.

Can you use regularization on every predictive methods (regression, classification)?

Yes. In the context of linear models, the elastic net is generalized by glmnet, which applies to any generalized linear model structure. For example, in logistic regression, the loss function for glmnet would be:

$$ L = \sum_i y_i \log(p_i) + (1 - y_i) \log(1 - p_i) + \lambda \left( \alpha \sum_j \| \beta_j \| + (1 - \alpha) \frac{1}{2} \sum_j \beta_j^2 \right) $$

These models are available in the glmnet package in R. You can learn to use the package here. You can read about how it works here.

There are other options as well, multilevel models can be seen as another way to apply regularization. That is well covered in the book by Gelman and Hill.

There are also a multitude of Bayesian approaches, here the choice of prior can be thought of as a regularization strategy.