Solved – What to do when lasso does not remove correlated variables

elastic netfeature selectionlassomulticollinearityregression

The very essence of lasso is that it is supposed to select only one of two correlated variables.

However, when I include two highly correlated predictors (they are correlated with each other at level ~0.95), both of them are being selected with similar absolute coefficient value (on standardized predictor), but with different signs. This means their effect on prediction almost cancels out, but the coefficients from model on standardized input are highest of all variables.

Example:

          x         coefs 
 (Intercept)        91.6958266
 Population_2013   -49.2656083
 Population_2014    46.8513210 

where Variable1 and Variable2 are highly correlated. Other correlated and uncorrelated variables are also included in model. I run models on anything between 20 and 20000 variables and effect is similar for these correlated variables.

Is there any solution? Alternatively – how in any other way can I determine which variables affect significantly my prediction?

Best Answer

The answer turned out to be simple: lambda was low so there was no regularization, therefore lasso did not work as expected. Solution was to manually select lambda instead of relying on lambda minimizing CV error.

Related Question