Solved – Lasso Regression for predicting Continuous Variable + Variable Selection

glmnetlassomulticollinearityrregression

I'm attempting to predict vegetation productivity based on climatic and land use variables (the latter are categorical). I found that there is a multicollinearity problem between the predictors (especially land use) as seen from the Variance Inflation Factor (VIF of the Ordinary Least Squares Regression).

Although my knowledge of lasso regression is basic, I assume lasso regression might solve the multicollinearity problem and also select variables that are driving the system. I appreciate an R code for estimating the standardized beta coefficients for the predictors or approaches on how to proceed.

Variable           Coeff.  Std Coeff.  VIF    Std Error    t      P  Value 
Constant          -0.228   0            0      0.086       -2.644  0.008  
Precipitation      <.001   0.151       2.688   <.001        8.541  0.0  
Solar Rad          0.002   0.343       2.836   <.001        18.939 <.001  
Temp              -0.116  -1.604       28.12   0.004       -28.11  0.0  
Water Stress       0.881   0.391       2.352   0.037        23.7   <.001  
Vapor Pressure     0.135   1.382       30.49   0.006        23.259 0.0    
  1               -0.103   -0.109      52.086  0.074       -1.398  0.162    
  2               -0.14    -0.048      6.49    0.079       -1.761  0.078   
  3               -0.11    -0.048      10.007  0.077       -1.42   0.156    
  4               -0.104   -0.234      236.288 0.073       -1.416  0.157    
  5               -0.097   -0.242      285.244 0.073       -1.331  0.183    
  6               -0.104   -0.09       35.067  0.074       -1.406  0.16    
  8               -0.119   -0.261      221.361 0.073       -1.629  0.103 
ELEVATION          <.001   -0.115      3.917   <.001       -5.381  <.001
Condition Number: 59.833 
Mean of Correlation Matrix: 0.221 1st    
Eigenvalue divided by m: 0.328

Best Answer

Be sure to install and load the glmnet package.

install.packages("glmnet")
library(glmnet)

First you need to form a matrix with all your predictors, we call that matrix $\mathbf{X}$. I have done this for three variables I have created but since you have more you will need to change the predictor matrix accordingly.

    set.seed(1)
    x1 <- rnorm(30)
    x2 <- rnorm(30)
    x3 <- rnorm(30)
    X <- matrix( c(x1, x2, x3), byrow = F, ncol = 3)

Then we need a response as well. This needs to be a vector as you know so let's form a linear combination of the predictors and corrupt it with some noise.

y <- 3 + 4*x1 + 3*x2 + 5*x3 + rnorm(30)

We have everything we need now to begin. Here is our first attempt with glmnet.

    fit <-glmnet(x = X, y = y, alpha = 1) 
# different values of alpha return different estimators, alpha = 1 is the lasso.
    plot(fit, xvar = "lambda")

which should produce

To interpet this plot, recall the optimization problem the lasso solves

$$\hat{\boldsymbol{\beta}}_{LASSO} = \min_{\beta} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) ^{\prime} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) + \lambda \sum_j \left|\beta_j \right| $$

so $\lambda$ is the penalty or the Lagrange multiplier if you prefer and is always positive. Setting $\lambda = 0 $ yields the familiar minimization of squared residuals while for greater values, some of the coefficients will be set to zero. As $\lambda \to \infty$, all the coefficents will be set to zero.

This is exactly what this plot shows then, the coefficient path for different values of lambda. For reasons beyond me the creators of this package have opted to present lambda on the log scale, thus values between zero and one are now negative.

Pick the value of lambda you like and you can extract the coefficients with the command

coef(fit, s = 0.3) # s is the value of lambda

To locate the point on the plot, simply do

log(0.3)

and as you can see since lambda is quite close to zero there is not much shrinkage. Of course now we have to select one of these values of $\lambda$ and visual inspection is not good enough.

We do this by the crossvalidation function of glmnet. If you have never heard of crossvalidation, all you need to know is that it is a predictive criterion that evaluates the sample performance by splitting the sample into training and validation sets and choosing the value of lambda with which the error of prediction is minimal.

crossval <-  cv.glmnet(x = X, y = y)
plot(crossval)
penalty <- crossval$lambda.min #optimal lambda
penalty #minimal shrinkage
fit1 <-glmnet(x = X, y = y, alpha = 1, lambda = penalty ) #estimate the model with that
coef(fit1)

I have chosen to plot the crossvalidation results just so you can see how this method works. We finally estimate the lasso with the optimal CV parameter and extract the coefficients. You will notice in my example that the shrinkage is minimal. This occurs because there is no multicollinearity in the sample thanks to the naive generation. I am certain your results will be much much different.

Here is a webpage where the creators explain in detail how to use the package. I kept it simple because you are only interested in the lasso and no other estimators.

Hope this helps.

Related Solutions

Lasso Regression – Disadvantages of Using Lasso for Variable Selection in Regression

There is NO reason to do stepwise selection. It's just wrong.

LASSO/LAR are the best automatic methods. But they are automatic methods. They let the analyst not think.

In many analyses, some variables should be in the model REGARDLESS of any measure of significance. Sometimes they are necessary control variables. Other times, finding a small effect can be substantively important.

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Using glmnet is really easy once you get the grasp of it thanks to its excellent vignette in http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html (you can also check the CRAN package page). As for the best lambda for glmnet, the rule of thumb is to use

cvfit <- glmnet::cv.glmnet(x, y)
coef(cvfit, s = "lambda.1se")

instead of lambda.min.

To do the same for lars you have to do it by hand. Here is my solution

cv <- lars::cv.lars(x, y, plot.it = FALSE, mode = "step")
idx <- which.max(cv$cv - cv$cv.error <= min(cv$cv))
coef(lars::lars(x, y))[idx,]

Bear in mind that this is not exactly the same, because this is stopping at a lasso knot (when a variable enters) instead of at any point.

Please note that glmnet is the preferred package now, it is actively maintained, more so than lars, and that there have been questions about glmnet vs lars answered before (algorithms used differ).

As for your question of using lasso to choose variables and then fit OLS, it is an ongoing debate. Google for OLS post Lasso and there are some papers discussing the topic. Even the authors of Elements of Statistical Learning admit it is possible.

Edit: Here is the code to reproduce more accurately what glmnet does in lars

  cv <- lars::cv.lars(x, y, plot.it = FALSE)
  ideal_l1_ratio <- cv$index[which.max(cv$cv - cv$cv.error <= min(cv$cv))]
  obj <- lars::lars(x, y)
  scaled_coefs <- scale(obj$beta, FALSE, 1 / obj$normx)
  l1 <- apply(X = scaled_coefs, MARGIN = 1, FUN = function(x) sum(abs(x)))
  coef(obj)[which.max(l1 / tail(l1, 1) > ideal_l1_ratio),]

Best Answer

Related Solutions

Lasso Regression – Disadvantages of Using Lasso for Variable Selection in Regression

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Related Question