Solved – What to conclude from this lasso plot (glmnet)

feature selectionglmnetlassoregression

Following is the plot of glmnet with default alpha(1, hence lasso) using mtcars data set in R with mpg as the DV and others as the predictor variables.

glmnet(as.matrix(mtcars[-1]), mtcars[,1])

enter image description here

What can we conclude from this plot regarding different variables, especially am, cyl and wt (red, black and light blue lines)? How would we phrase the output in a report to be published?

I thought of following:

wt is the most important predictor of mpg. It is negatively affecting the mpg.
cyl is a weak negative predictor of mpg.
am may be a positive predictor of mpg.
Other variables are not robust predictors of mpg.

Thanks for your thoughts on this.

(Note: cyl is the black line which does not reach 0 till very close to it.)

Edit: Following is plot(mod, xvar='lambda') which shows x-axis in order reverse of above plot:

enter image description here

Best Answer

I think when trying to interpret these plots of coefficients by $\lambda$, $\log(\lambda)$, or $\sum_i | \beta_i |$, it helps a lot to know how they look in some simple cases. In particular, how they look when your model design matrix is uncorrelated, vs. when there is correlation in your design.

To that end, I created some correlated and uncorrelated data to demonstrate:

x_uncorr <- matrix(runif(30000), nrow=10000)
y_uncorr <- 1 + 2*x_uncorr[,1] - x_uncorr[,2] + .5*x_uncorr[,3]

sigma <- matrix(c(  1, -.5,   0,
                  -.5,   1, -.5,
                    0, -.5,   1), nrow=3, byrow=TRUE
)
x_corr <- x_uncorr %*% sqrtm(sigma)
y_corr <- y_uncorr <- 1 + 2*x_corr[,1] - x_corr[,2] + .5*x_corr[,3]

The data x_uncorr has uncorrelated columns

> round(cor(x_uncorr), 2)
     [,1]  [,2]  [,3]
[1,] 1.00  0.01  0.00
[2,] 0.01  1.00 -0.01
[3,] 0.00 -0.01  1.00

while x_corr has a pre-set correlation between the columns

> round(cor(x_corr), 2)
      [,1]  [,2]  [,3]
[1,]  1.00 -0.49  0.00
[2,] -0.49  1.00 -0.51
[3,]  0.00 -0.51  1.00

Now lets look at the lasso plots for both of these cases. First the uncorrelated data

gnet_uncorr <- glmnet(x_uncorr, y_uncorr)
plot(gnet_uncorr)

enter image description here

A couple features stand out

The predictors go into the model in the order of their magnitude of true linear regression coefficient.
The coefficient path of each feature is a line (with respect to $\sum_i | \beta_i |$) is piecewise linear, and only changes when a new predictor enters the model. This is only true for the plot with respect to $\sum_i | \beta_i |$, and is a good reason to prefer it over the others.
When a new predictor enters the model, it affects the slope of the coefficient path of all predictors already in the model in a deterministic way. For example, when the second predictor enters the model, the slope of the first coefficient path is cut in half. When the third predictor enters the model, the slope of the coefficient path is one third its original value.

These are all general facts that apply to lasso regression with uncorrelated data, and they can all be either proven by hand (good exercise!) or found in the literature.

Now lets do correlated data

gnet_corr <- glmnet(x_corr, y_corr)
plot(gnet_corr)

enter image description here

You can read some things off this plot by comparing it to the uncorrelated case

The first and second predictor paths have the same structure as the uncorrelated case until the third predictor enters the model, even though they are correlated. This is a special feature of the two predictor case, which I can explain in another answer if there is interest, it would take me a little far afield of the current discussion.
On the other hand, once the third predictor enters the model, we see deviations from the picture we would expect if all three features were uncorrelated. The coefficient of the second feature flattens out, and the third feature rises to its final value. Notice the slope of the first feature is unaffected, which we would not anticipate if there were no correlations! Essentially, resources spend on the coefficients within a group of three or larger may be "traded around" until the assignment of minimal $\sum | \beta_i |$ is found.

So now let's look at your plot from the cars dataset and read some interesting things off (I reproduced your plot here so this discussion is easier to read):

A word of warning: I wrote the following analysis predicated on the assumption that the curves show the standardized coefficients, in this example they do not. Non-standardized coefficients are not dimensionless, and not comparable, so no conclusions may be drawn from them in terms of predictive importance. For the following analysis to be valid, please pretend that the plot is of the standardized coefficients, and please perform you own analysis on standardized coefficient paths.

enter image description here

As you say, the wt predictor seems very important. It enters the model first, and has a slow and steady descent to its final value. It does have a few correlations that make it a slightly bumpy ride, am in particular seems to have a drastic effect when it enters.
am is also important. It comes in later, and is correlated with wt, as it affects the slope of wt in a violent way. It is also correlated with carb and qsec, because we do not see the predictable softening of slope when those enter. After these four variables have entered though, we do see the nice uncorrelated pattern, so it seems to be uncorrelated with all the predictors at the end.
Something enters at around 2.25 on the x-axis, but its path itself is imperceptible, you can only detect it by its affect to the cyl and wt parameters.
cyl is quite facinating. It enters second, so is important for small models. After other variables, and especially am enter, it is not so important anymore, and its trend reverses, eventually being all but removed. It seems like the effect of cyl can be completely captured by the variables that enter at the end of the process. Whether it is more appropriate to use cyl, or the complementary group of variables, really depends on the bias-variance tradeoff. Having the group in your final model would increase its variance significantly, but it may be the case that the lower bias makes up for it!

That's a small introduction to how I've learned to read information off of these plots. I think they are tons of fun!

Thanks for a great analysis. To report in simple terms, would you say that wt, am and cyl are 3 most important predictors of mpg. Also, if you want to create a model for prediction, which ones will you include based on this figure: wt, am and cyl ? Or some other combination. Also, you do not seem to need best lambda for analysis. Is it not important as in ridge regression?

I'd say the case for wt and am are clear cut, they are important. cyl is much more subtle, it is important in a small model, but not at all relevant in a large one.

I would not be able to make a determination of what to include based only on the figure, that really must be answered the context of what you are doing. You could say that if you want a three predictor model, then wt, am and cyl are good choices, as they are relevant in the grand scheme of things, and should end up having reasonable effect sizes in a small model. This is predicated on the assumption that you have some external reason to desire a small three predictor model though.

It's true, this type of analysis looks over the entire spectrum of lambdas and lets you cull relationships over a range of model complexities. That said, for a final model, I think tuning an optimal lambda is very important. In the absence of other constraints, I would definitely use cross validation to find where along this spectrum the most predictive lambda is, and then use that lambda for a final model, and a final analysis.

The reason I recommend this has more to do with the right hand side of the graph than the left hand side. For some of the larger lambdas, it could be the case that the model is overfit to the training data. In this case, anything you deduced from the plot in this regime would be properties of the noise in the dataset instead of the structure of the statistical process itself. Once you have an estimate of the optimal $\lambda$, you have a sense for how much of the plot can be trusted.

In the other direction, sometimes there are outside constraints for how complex a model can be (implementation costs, legacy systems, explanatory minimalism, business interpretability, aesthetic patrimony) and this kind of inspection can really help you understand the shape of your data, and the tradeoffs you are making by choosing a smaller than optimal model.

Related Solutions

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Using glmnet is really easy once you get the grasp of it thanks to its excellent vignette in http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html (you can also check the CRAN package page). As for the best lambda for glmnet, the rule of thumb is to use

cvfit <- glmnet::cv.glmnet(x, y)
coef(cvfit, s = "lambda.1se")

instead of lambda.min.

To do the same for lars you have to do it by hand. Here is my solution

cv <- lars::cv.lars(x, y, plot.it = FALSE, mode = "step")
idx <- which.max(cv$cv - cv$cv.error <= min(cv$cv))
coef(lars::lars(x, y))[idx,]

Bear in mind that this is not exactly the same, because this is stopping at a lasso knot (when a variable enters) instead of at any point.

Please note that glmnet is the preferred package now, it is actively maintained, more so than lars, and that there have been questions about glmnet vs lars answered before (algorithms used differ).

As for your question of using lasso to choose variables and then fit OLS, it is an ongoing debate. Google for OLS post Lasso and there are some papers discussing the topic. Even the authors of Elements of Statistical Learning admit it is possible.

Edit: Here is the code to reproduce more accurately what glmnet does in lars

  cv <- lars::cv.lars(x, y, plot.it = FALSE)
  ideal_l1_ratio <- cv$index[which.max(cv$cv - cv$cv.error <= min(cv$cv))]
  obj <- lars::lars(x, y)
  scaled_coefs <- scale(obj$beta, FALSE, 1 / obj$normx)
  l1 <- apply(X = scaled_coefs, MARGIN = 1, FUN = function(x) sum(abs(x)))
  coef(obj)[which.max(l1 / tail(l1, 1) > ideal_l1_ratio),]

Solved – LASSO plot label lines with names using glmnet

The plot_glmnet function in the plotmo package allows more flexibility in the way labels are handled and can handle the issues you mention. For example, the following code

library(glmnet)
mod <- glmnet(as.matrix(mtcars[-1]), mtcars[,1])
library(plotmo) # for plot_glmnet
plot_glmnet(mod)                             # default colors
plot_glmnet(mod, label=5)                    # label the 5 biggest final coefs
g <- "gray" 
plot_glmnet(mod, col=c(2,2,g,g,g,g,g,3,3,3)) # specify color of each line
plot_glmnet(mod, col=c(2,2,0,0,0,0,0,3,3,3)) # specify color=0 for no line

gives

plot http://www.milbo.org/doc/plot-glmnet-labels.png

Futher examples may be found in Chapter 6 in plotres vignette which is included in the plotmo plotmo package.

Best Answer

Related Solutions

Solved – Using LASSO from lars (or glmnet) package in R for variable selection

Solved – LASSO plot label lines with names using glmnet

Related Question