Solved – What to do when lasso does not remove correlated variables

elastic netfeature selectionlassomulticollinearityregression

The very essence of lasso is that it is supposed to select only one of two correlated variables.

However, when I include two highly correlated predictors (they are correlated with each other at level ~0.95), both of them are being selected with similar absolute coefficient value (on standardized predictor), but with different signs. This means their effect on prediction almost cancels out, but the coefficients from model on standardized input are highest of all variables.

Example:

          x         coefs 
 (Intercept)        91.6958266
 Population_2013   -49.2656083
 Population_2014    46.8513210

where Variable1 and Variable2 are highly correlated. Other correlated and uncorrelated variables are also included in model. I run models on anything between 20 and 20000 variables and effect is similar for these correlated variables.

Is there any solution? Alternatively – how in any other way can I determine which variables affect significantly my prediction?

Best Answer

The answer turned out to be simple: lambda was low so there was no regularization, therefore lasso did not work as expected. Solution was to manually select lambda instead of relying on lambda minimizing CV error.

Related Solutions

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Using glmnet is really easy once you get the grasp of it thanks to its excellent vignette in http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html (you can also check the CRAN package page). As for the best lambda for glmnet, the rule of thumb is to use

cvfit <- glmnet::cv.glmnet(x, y)
coef(cvfit, s = "lambda.1se")

instead of lambda.min.

To do the same for lars you have to do it by hand. Here is my solution

cv <- lars::cv.lars(x, y, plot.it = FALSE, mode = "step")
idx <- which.max(cv$cv - cv$cv.error <= min(cv$cv))
coef(lars::lars(x, y))[idx,]

Bear in mind that this is not exactly the same, because this is stopping at a lasso knot (when a variable enters) instead of at any point.

Please note that glmnet is the preferred package now, it is actively maintained, more so than lars, and that there have been questions about glmnet vs lars answered before (algorithms used differ).

As for your question of using lasso to choose variables and then fit OLS, it is an ongoing debate. Google for OLS post Lasso and there are some papers discussing the topic. Even the authors of Elements of Statistical Learning admit it is possible.

Edit: Here is the code to reproduce more accurately what glmnet does in lars

  cv <- lars::cv.lars(x, y, plot.it = FALSE)
  ideal_l1_ratio <- cv$index[which.max(cv$cv - cv$cv.error <= min(cv$cv))]
  obj <- lars::lars(x, y)
  scaled_coefs <- scale(obj$beta, FALSE, 1 / obj$normx)
  l1 <- apply(X = scaled_coefs, MARGIN = 1, FUN = function(x) sum(abs(x)))
  coef(obj)[which.max(l1 / tail(l1, 1) > ideal_l1_ratio),]

Regularization – Why Lasso or ElasticNet Perform Better Than Ridge with Correlated Features

Suppose you have two highly correlated predictor variables $x,z$, and suppose both are centered and scaled (to mean zero, variance one). Then the ridge penalty on the parameter vector is $\beta_1^2 + \beta_2^2$ while the lasso penalty term is $ \mid \beta_1 \mid + \mid \beta_2 \mid$. Now, since the model is supposed highly colinear, so that $x$ and $z$ more or less can substitute each other in predicting $Y$, so many linear combination of $x, z$ where we simply substitute in part $x$ for $z$, will work very similarly as predictors, for example $0.2 x + 0.8 z, 0.3 x + 0.7 z$ or $0.5 x + 0.5 z$ will be about equally good as predictors. Now look at these three examples, the lasso penalty in all three cases are equal, it is 1, while the ridge penalty differ, it is respectively 0.68, 0.58, 0.5, so the ridge penalty will prefer equal weighting of colinear variables while lasso penalty will not be able to choose. This is one reason ridge (or more generally, elastic net, which is a linear combination of lasso and ridge penalties) will work better with colinear predictors: When the data give little reason to choose between different linear combinations of colinear predictors, lasso will just "wander" while ridge tends to choose equal weighting. That last might be a better guess for use with future data! And, if that is so with present data, could show up in cross validation as better results with ridge.

We can view this in a Bayesian way: Ridge and lasso implies different prior information, and the prior information implied by ridge tend to be more reasonable in such situations. (This explanation here I learned , more or less, from the book: "Statistical Learning with Sparsity The Lasso and Generalizations" by Trevor Hastie, Robert Tibshirani and Martin Wainwright, but at this moment I was not able to find a direct quote).

But the OP seems to have a different problem:

However, my results show that the mean absolute error of Lasso or Elastic is around 0.61 whereas this score is 0.97 for the ridge regression

Now, lasso is also effectively doing variable selection, it can set some coefficients exactly to zero. Ridge cannot do that (except with probability zero.) So it might be that with the OP data, among the colinear variables, some are effective and others don't act at all (and the degree of colinearity sufficiently low that this can be detected.) See When should I use lasso vs ridge? where this is discussed. A detailed analysis would need more information than is given in the question.

Best Answer

Related Solutions

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Regularization – Why Lasso or ElasticNet Perform Better Than Ridge with Correlated Features

Related Question