I'm attempting to predict vegetation productivity based on climatic and land use variables (the latter are categorical). I found that there is a multicollinearity problem between the predictors (especially land use) as seen from the Variance Inflation Factor (VIF of the Ordinary Least Squares Regression).
Although my knowledge of lasso regression is basic, I assume lasso regression might solve the multicollinearity problem and also select variables that are driving the system. I appreciate an R code for estimating the standardized beta coefficients for the predictors or approaches on how to proceed.
Variable Coeff. Std Coeff. VIF Std Error t P Value
Constant -0.228 0 0 0.086 -2.644 0.008
Precipitation <.001 0.151 2.688 <.001 8.541 0.0
Solar Rad 0.002 0.343 2.836 <.001 18.939 <.001
Temp -0.116 -1.604 28.12 0.004 -28.11 0.0
Water Stress 0.881 0.391 2.352 0.037 23.7 <.001
Vapor Pressure 0.135 1.382 30.49 0.006 23.259 0.0
1 -0.103 -0.109 52.086 0.074 -1.398 0.162
2 -0.14 -0.048 6.49 0.079 -1.761 0.078
3 -0.11 -0.048 10.007 0.077 -1.42 0.156
4 -0.104 -0.234 236.288 0.073 -1.416 0.157
5 -0.097 -0.242 285.244 0.073 -1.331 0.183
6 -0.104 -0.09 35.067 0.074 -1.406 0.16
8 -0.119 -0.261 221.361 0.073 -1.629 0.103
ELEVATION <.001 -0.115 3.917 <.001 -5.381 <.001
Condition Number: 59.833
Mean of Correlation Matrix: 0.221 1st
Eigenvalue divided by m: 0.328
Best Answer
Be sure to install and load the
glmnet
package.First you need to form a matrix with all your predictors, we call that matrix $\mathbf{X}$. I have done this for three variables I have created but since you have more you will need to change the predictor matrix accordingly.
Then we need a response as well. This needs to be a vector as you know so let's form a linear combination of the predictors and corrupt it with some noise.
We have everything we need now to begin. Here is our first attempt with
glmnet
.which should produce
To interpet this plot, recall the optimization problem the lasso solves
$$\hat{\boldsymbol{\beta}}_{LASSO} = \min_{\beta} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) ^{\prime} \left( \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right) + \lambda \sum_j \left|\beta_j \right| $$
so $\lambda$ is the penalty or the Lagrange multiplier if you prefer and is always positive. Setting $\lambda = 0 $ yields the familiar minimization of squared residuals while for greater values, some of the coefficients will be set to zero. As $\lambda \to \infty$, all the coefficents will be set to zero.
This is exactly what this plot shows then, the coefficient path for different values of lambda. For reasons beyond me the creators of this package have opted to present lambda on the log scale, thus values between zero and one are now negative.
Pick the value of lambda you like and you can extract the coefficients with the command
To locate the point on the plot, simply do
and as you can see since lambda is quite close to zero there is not much shrinkage. Of course now we have to select one of these values of $\lambda$ and visual inspection is not good enough.
We do this by the crossvalidation function of
glmnet
. If you have never heard of crossvalidation, all you need to know is that it is a predictive criterion that evaluates the sample performance by splitting the sample into training and validation sets and choosing the value of lambda with which the error of prediction is minimal.I have chosen to plot the crossvalidation results just so you can see how this method works. We finally estimate the lasso with the optimal CV parameter and extract the coefficients. You will notice in my example that the shrinkage is minimal. This occurs because there is no multicollinearity in the sample thanks to the naive generation. I am certain your results will be much much different.
Here is a webpage where the creators explain in detail how to use the package. I kept it simple because you are only interested in the lasso and no other estimators.
Hope this helps.