Solved – Ridge/Lasso Lambda greater than 1

lassoregressionregularizationridge regression

I ran Ridge and Lasso regressions using an algorithm to automatically find the optimum lambda.

However, the algorithm couldn't find an optimum lambda between 0 and 1. In some cases I could find optimum lambdas that were a lot higher than 1 (sometimes 4 or 5 or even higher).

What does that exactly mean? I always read that the optimum lambda is mostly just a little higher than 0 and definitely not higher than one.

Does it mean that ridge and lasso aren't applicable in that case?

Thx a lot in advance,
Tobias

Best Answer

Let's work with the lasso. Recall how a lasso regression model is fitted, given $\lambda$:

$$\min_{\beta\in\mathbb{R}^p}\left\{\frac{1}{N}\|y-X\beta\|_2^2-\lambda\|\beta\|_1\right\}$$

The first part of the summand gives the mean squared 2-norm of the residuals. The second part gives the 1-norm of the parameter vector (typically not including the intercept entry $\beta_0$).

There is no reason whatsoever these two components should be comparable in magnitude. Your model could fit very well, yielding small residuals, but need large parameters. Or the other way around. Plus, you may or may not first standardize your predictors, which will change the parameter estimates.

This applies to the estimate for $\beta$, given $\lambda$. Now, if you optimize $\lambda$, perhaps using cross-validation, this means that a priori you cannot say anything about the likely range of $\lambda$, other than $\lambda\geq 0$.

TL;DR: you appear to have misremembered. The optimum $\lambda$ in no way needs to be in some specific interval. Therefore, getting a "surprising" value does not tell you anything about the appropriateness (or not) of your lasso model.

The same of course applies to ridge regression or the elastic net.

Related Solutions

Solved – Ridge and LASSO given a covariance structure

If we know the Cholesky decomposition $V^{-1} = L^TL$, say, then $$(y - X\beta)^T V^{-1} (y - X\beta) = (Ly - LX\beta)^T (Ly - LX\beta)$$ and we can use standard algorithms (with whatever penalization function one prefers) by replacing the response with the vector $Ly$ and the predictors with the matrix $LX$.

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Finally we were able to produce the same solution with both methods! First issue is that glmnet solves the lasso problem as stated in the question, but lars has a slightly different normalization in the objective function, it replaces $\frac{1}{2N}$by $\frac{1}{2}$. Second, both methods normalize the data differently, so the normalization must be swiched off when calling the methods.

To reproduce that, and see that the same solutions for the lasso problem can be computed using lars and glmnet, the following lines in the code above must be changed:

la <- lars(X,Y,intercept=TRUE, max.steps=1000, use.Gram=FALSE)

la <- lars(X,Y,intercept=TRUE, normalize=FALSE, max.steps=1000, use.Gram=FALSE)

and

glm2 <- glmnet(X,Y,family="gaussian",lambda=0.5*la$lambda,thresh=1e-16)

glm2 <- glmnet(X,Y,family="gaussian",lambda=1/nbSamples*la$lambda,standardize=FALSE,thresh=1e-16)

Best Answer

Related Solutions

Solved – Ridge and LASSO given a covariance structure

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Related Question