Solved – Why lasso yield a higher mse then ridge

lassopredictionregularizationridge regression

I do a rige and lasso regression on a train data set and get the lambdas via cross validation and evalute the prediction accuracy on a test data set.

After that i do the same procedure for the same data but adding polynomials to the power of 4 and interactions to the order of 2 (V1*V2)+(V1*V3) to the train and test data set.

At the end i get a smaller test mse with ridge for the model with interactions and polynomials compared to the model without interactions. That's a result that i would expect.

But for lasso I have a higher test mse compared to the model without interactions and polynomials. That's what i don't expected.

I don't understand why lasso performs worse than ridge and worse than a model with less explanatory variables?

Best Answer

As Frank Harrell notes in an answer to another question, ridge generally performs better than LASSO in predictions. The main exception is if there really are only a handful of true predictors. In the real world with often correlated predictors, that frequently is not the case.

Your use of interactions and polynomials probably made this problem worse. Typical LASSO formulations do not try to keep main effects and their related interaction terms together in models, so it's possible to develop models that include interaction terms without the associated main effects. That's often not a good idea.

If you also have polynomial modeling of the predictors involved in the interaction terms, then you are starting with a very large set of potential predictors. Ridge regression will include all of those predictors with appropriate penalization of the associated coefficients. LASSO, by its nature, will just choose a few in a way that might not be repeated from sample to sample.

Repeating LASSO selection of predictors on multiple bootstrap samples of your original data can be instructive. You may find wildly different sets of predictors chosen by LASSO among such samples. Ridge regression will keep all of the predictors, and you may be surprised to find how well the values of the penalized coefficients agree among such samples.