Finally we were able to produce the same solution with both methods! First issue is that glmnet solves the lasso problem as stated in the question, but lars has a slightly different normalization in the objective function, it replaces $\frac{1}{2N}$by $\frac{1}{2}$. Second, both methods normalize the data differently, so the normalization must be swiched off when calling the methods.
To reproduce that, and see that the same solutions for the lasso problem can be computed using lars and glmnet, the following lines in the code above must be changed:
la <- lars(X,Y,intercept=TRUE, max.steps=1000, use.Gram=FALSE)
to
la <- lars(X,Y,intercept=TRUE, normalize=FALSE, max.steps=1000, use.Gram=FALSE)
and
glm2 <- glmnet(X,Y,family="gaussian",lambda=0.5*la$lambda,thresh=1e-16)
to
glm2 <- glmnet(X,Y,family="gaussian",lambda=1/nbSamples*la$lambda,standardize=FALSE,thresh=1e-16)
For these models, they are regression the coefficients for the final Model. Big coefficients are associated with larger effects. Using scale = FALSE
is good here so you can also get the signs too.
There are always pitfalls with these measures depending on how you want to measure importance. They don't measure lack of fit at all, so if your model is 51% accurate, they are not very reflective of the data. In the case of regression coefficients, main effects are misleading when interactions are present and so on.
As for correlation between predictors, Friedman et al. (2010, JSS) state:
Ridge regression is known to shrink the coefficients of correlated predictors towards each other, allowing them to borrow strength from each other. In the extreme case of $k$ identical predictors, they each get identical coefficients with $1/k^{th}$ the size that any single one would get if fit alone.[...]
Lasso, on the other hand, is somewhat indifferent to very correlated predictors, and will tend to pick one and ignore the rest.
We have a pretty good example of that in Section 6.4 of APM
Max
Best Answer
As far as I understand
glmnet
, $\alpha=0$ would actually be a ridge penalty, and $\alpha=1$ would be a Lasso penalty (rather than the other way around) and as far asglmnet
is concerned you can fit those end cases.The penalty with $\alpha=0.1$ would be fairly similar to the ridge penalty but it is not the ridge penalty; if it's not considering $\alpha$ below $0.1$ you can't necessarily infer much more than that just from the fact that you had that endpoint. If you know that an $\alpha$ value that was only slightly larger was worse then it would be likely that a larger range might have chosen a smaller $\alpha$, but it doesn't suggest it would have been $0$; I expect it would not. If the grid of values is coarse it may well have been that a larger value than $0.1$ would be better.
[You may want to check whether there was some other reason that $\alpha$ might have been at an endpoint; e.g. I seem to recall $\lambda$ got set to an endpoint in forecasting if coefficients for
lambdaOpt
were not saved.]