Finally we were able to produce the same solution with both methods! First issue is that glmnet solves the lasso problem as stated in the question, but lars has a slightly different normalization in the objective function, it replaces $\frac{1}{2N}$by $\frac{1}{2}$. Second, both methods normalize the data differently, so the normalization must be swiched off when calling the methods.
To reproduce that, and see that the same solutions for the lasso problem can be computed using lars and glmnet, the following lines in the code above must be changed:
la <- lars(X,Y,intercept=TRUE, max.steps=1000, use.Gram=FALSE)
to
la <- lars(X,Y,intercept=TRUE, normalize=FALSE, max.steps=1000, use.Gram=FALSE)
and
glm2 <- glmnet(X,Y,family="gaussian",lambda=0.5*la$lambda,thresh=1e-16)
to
glm2 <- glmnet(X,Y,family="gaussian",lambda=1/nbSamples*la$lambda,standardize=FALSE,thresh=1e-16)
Here's an unintuitive fact - you're not actually supposed to give glmnet a single value of lambda. From the documentation here:
Do not supply a single value for lambda (for predictions after CV use predict() instead).
Supply instead a decreasing sequence of lambda values. glmnet relies on its
warms starts for speed, and its often faster to fit a whole path than compute a
single fit.
cv.glmnet
will help you choose lambda, as you alluded to in your examples. The authors of the glmnet package suggest cv$lambda.1se
instead of cv$lambda.min
, but in practice I've had success with the latter.
After running cv.glmnet, you don't have to rerun glmnet! Every lambda in the grid (cv$lambda
) has already been run. This technique is called "Warm Start" and you can read more about it here. Paraphrasing from the introduction, the Warm Start technique reduces running time of iterative methods by using the solution of a different optimization problem (e.g., glmnet with a larger lambda) as the starting value for a later optimization problem (e.g., glmnet with a smaller lambda).
To extract the desired run from cv.glmnet.fit
, try this:
small.lambda.index <- which(cv$lambda == cv$lambda.min)
small.lambda.betas <- cv$glmnet.fit$beta[, small.lambda.index]
Revision (1/28/2017)
No need to hack to the glmnet object like I did above; take @alex23lemm's advice below and pass the s = "lambda.min"
, s = "lambda.1se"
or some other number (e.g., s = .007
) to both coef
and predict
. Note that your coefficients and predictions depend on this value which is set by cross validation. Use a seed for reproducibility! And don't forget that if you don't supply an "s"
in coef
and predict
, you'll be using the default of s = "lambda.1se"
. I have warmed up to that default after seeing it work better in a small data situation. s = "lambda.1se"
also tends to provide more regularization, so if you're working with alpha > 0, it will also tend towards a more parsimonious model. You can also choose a numerical value of s with the help of plot.glmnet to get to somewhere in between (just don't forget to exponentiate the values from the x axis!).
Best Answer
glmnet
is more setup to solve the first formulation of the L1 penalty instead of the size constraint.glmnet
has a fairly unique method of solving the problem, but is probably the best out there. For most applications, you just feed it your design matrix in thex
parameter and the response in they
parameter. Thex
needs to already have factors built into dummies/contrasts. I would usually recommend doing full dummy coding instead of the default treatment coding unless you have a true reference category to shrink towards.You need to then set your alpha to choose the balance between Lasso vs Ridge. Alpha = 1 will give you full LASSO. Usually you would want alpha=0.95 to have a touch of ridging to handle any lingering multicollinearity and just give a bit better conditioning to the problem in general. If you're more focused on predictive accuracy than parsimony then setting alpha much further towards the ridge side of things is usually better (Alpha ~ 0.05).
glmnet
will then fit a whole string of lambda values. It starts with the smallest lambda that produces an intercept only model and then slowly lowers the lambda by ~8 orders of magnitude, making a handful of discrete estimates along the way. You can control how deep it goes and how many stops via some lambda specific parameters.Lastly, you'll probably want to use the
cv.glmnet
function to pick a good value for lambda. The cross validation will give you a good estimate of lambda to use when making predictions on new data. Make sure the resulting lambda sequence reaches a minimum. It usually produces a nice visual of the bias-variance tradeoff, but sometimes you'll have to create a longer sequence of lambdas if not much penalty is really needed. Just be sure your custom sequence starts with an intercept-only model and progresses smoothly like the default sequence.Of minor notes, do be sure you choose the correct distribution for your response. And do know that the design matrix can be passed in sparse format if you have some factors with many levels. And read the useful vignettes available on the CRAN
glmnet
webpage. Good luck!Update - Dummy Coding Contrast snippet:
There is likely a better way, but this should get the job done. This changes the default factor encoding in formulas to be full dummy coding: