Solved – Lasso Regression for predicting Continuous Variable + Variable Selection

glmnetlassomulticollinearityrregression

I'm attempting to predict vegetation productivity based on climatic and land use variables (the latter are categorical). I found that there is a multicollinearity problem between the predictors (especially land use) as seen from the Variance Inflation Factor (VIF of the Ordinary Least Squares Regression).

Although my knowledge of lasso regression is basic, I assume lasso regression might solve the multicollinearity problem and also select variables that are driving the system. I appreciate an R code for estimating the standardized beta coefficients for the predictors or approaches on how to proceed.

Variable           Coeff.  Std Coeff.  VIF    Std Error    t      P  Value 
Constant          -0.228   0            0      0.086       -2.644  0.008  
Precipitation      <.001   0.151       2.688   <.001        8.541  0.0  
Solar Rad          0.002   0.343       2.836   <.001        18.939 <.001  
Temp              -0.116  -1.604       28.12   0.004       -28.11  0.0  
Water Stress       0.881   0.391       2.352   0.037        23.7   <.001  
Vapor Pressure     0.135   1.382       30.49   0.006        23.259 0.0    
  1               -0.103   -0.109      52.086  0.074       -1.398  0.162    
  2               -0.14    -0.048      6.49    0.079       -1.761  0.078   
  3               -0.11    -0.048      10.007  0.077       -1.42   0.156    
  4               -0.104   -0.234      236.288 0.073       -1.416  0.157    
  5               -0.097   -0.242      285.244 0.073       -1.331  0.183    
  6               -0.104   -0.09       35.067  0.074       -1.406  0.16    
  8               -0.119   -0.261      221.361 0.073       -1.629  0.103 
ELEVATION          <.001   -0.115      3.917   <.001       -5.381  <.001
Condition Number: 59.833 
Mean of Correlation Matrix: 0.221 1st    
Eigenvalue divided by m: 0.328

Best Answer

Be sure to install and load the glmnet package.

install.packages("glmnet")
library(glmnet)

First you need to form a matrix with all your predictors, we call that matrix $\mathbf{X}$. I have done this for three variables I have created but since you have more you will need to change the predictor matrix accordingly.

    set.seed(1)
    x1 <- rnorm(30)
    x2 <- rnorm(30)
    x3 <- rnorm(30)
    X <- matrix( c(x1, x2, x3), byrow = F, ncol = 3)

Then we need a response as well. This needs to be a vector as you know so let's form a linear combination of the predictors and corrupt it with some noise.

y <- 3 + 4*x1 + 3*x2 + 5*x3 + rnorm(30)

We have everything we need now to begin. Here is our first attempt with glmnet.

    fit <-glmnet(x = X, y = y, alpha = 1) 
# different values of alpha return different estimators, alpha = 1 is the lasso.
    plot(fit, xvar = "lambda")

which should produce

enter image description here

To interpet this plot, recall the optimization problem the lasso solves

$$\hat{\boldsymbol{\beta}}_{LASSO} = \min_{\beta} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) ^{\prime} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) + \lambda \sum_j \left|\beta_j \right| $$

so $\lambda$ is the penalty or the Lagrange multiplier if you prefer and is always positive. Setting $\lambda = 0 $ yields the familiar minimization of squared residuals while for greater values, some of the coefficients will be set to zero. As $\lambda \to \infty$, all the coefficents will be set to zero.

This is exactly what this plot shows then, the coefficient path for different values of lambda. For reasons beyond me the creators of this package have opted to present lambda on the log scale, thus values between zero and one are now negative.

Pick the value of lambda you like and you can extract the coefficients with the command

coef(fit, s = 0.3) # s is the value of lambda

To locate the point on the plot, simply do

log(0.3)

and as you can see since lambda is quite close to zero there is not much shrinkage. Of course now we have to select one of these values of $\lambda$ and visual inspection is not good enough.

We do this by the crossvalidation function of glmnet. If you have never heard of crossvalidation, all you need to know is that it is a predictive criterion that evaluates the sample performance by splitting the sample into training and validation sets and choosing the value of lambda with which the error of prediction is minimal.

crossval <-  cv.glmnet(x = X, y = y)
plot(crossval)
penalty <- crossval$lambda.min #optimal lambda
penalty #minimal shrinkage
fit1 <-glmnet(x = X, y = y, alpha = 1, lambda = penalty ) #estimate the model with that
coef(fit1)

I have chosen to plot the crossvalidation results just so you can see how this method works. We finally estimate the lasso with the optimal CV parameter and extract the coefficients. You will notice in my example that the shrinkage is minimal. This occurs because there is no multicollinearity in the sample thanks to the naive generation. I am certain your results will be much much different.

Here is a webpage where the creators explain in detail how to use the package. I kept it simple because you are only interested in the lasso and no other estimators.

Hope this helps.

Related Question