Solved – R – Lasso Regression – different Lambda per regressor

glmnetlarsrregression

I want to do the following:

1) OLS regression (no penalization term) to get beta coefficients $b_{j}^{*}$; $j$ stands for the variables used to regress. I do this by

lm.model = lm(y~ 0 + x)
betas    = coefficients(lm.model)

2) Lasso regression with a penalization term, the selection criteria shall be the Bayesian Information Criteria (BIC), given by

$$\lambda _{j} = \frac{\log (T)}{T|b_{j}^{*}|}$$

where $j$ stands for the variable/regressor number, $T$ for the number of observations, and $b_{j}^{*}$ for the initial betas obtained in step 1). I want to have regression results for this specific $\lambda_j$ value, which is different for each regressor used. Hence if there are three variables, there will be three different values $\lambda_j$.

The OLS-Lasso optimization problem is then given by

$$\underset{b\epsilon \mathbb{R}^{n} }{min} = \left \{ \sum_{t=1}^{T}(y_{t}-b^{\top} X_{t} )^{2} + T\sum_{j=1}^{m} ( \lambda_{t}|b_{j}| )\right \}$$

How can I do this in R with either the lars or glmnet package? I cannot find a way to specify lambda and I am not 100% sure if I get the correct results if I run

lars.model <- lars(x,y,type = "lasso", intercept = FALSE)
predict.lars(lars.model, type="coefficients", mode="lambda")

I appreciate any help here.

Update:

I have used the following code now:

fits.cv = cv.glmnet(x,y,type="mse",penalty.factor = pnlty)
lmin    = as.numeric(fits.cv[9]) #lambda.min
fits    = glmnet(x,y, alpha=1, intercept=FALSE, penalty.factor = pnlty)
coef    = coef(fits, s = lmin)

In line 1 I use cross validation with my specified penalty factor ($\lambda _{j} = \frac{\log (T)}{T|b_{j}^{*}|}$), which is different for each regressor.
Line 2 selects the "lambda.min" of fits.cv, which is the lambda that gives minimum mean cross-validation error.
Line 3 performs a lasso fit (alpha=1) on the data. Again I used the penalty factor $\lambda$.
Line 4 extracts the coefficients from fits which belong to the "optimal" $\lambda$ chosen in line 2.

Now I have the beta coefficients for the regressors which depict the optimal solution of the minimization problem

$$\underset{b\epsilon \mathbb{R}^{n} }{min} = \left \{ \sum_{t=1}^{T}(y_{t}-b^{\top} X_{t} )^{2} + T\sum_{j=1}^{m} ( \lambda_{t}|b_{j}| )\right \}$$

with a penalty factor $\lambda _{j} = \frac{\log (T)}{T|b_{j}^{*}|}$. The optimal set of coefficients is most likely a subset of the regressors which I initially used, this is a consequence of the Lasso method which shrinks down the number of used regressors.

Is my understanding and the code correct?

Best Answer

From the glmnet documentation (?glmnet), we see that it is possible to perform differential shrinkage. This gets us at least part-way to answering OP's question.

penalty.factor: Separate penalty factors can be applied to each coefficient. This is a number that multiplies lambda to allow differential shrinkage. Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model. Default is 1 for all variables (and implicitly infinity for variables listed in exclude). Note: the penalty factors are internally rescaled to sum to nvars, and the lambda sequence will reflect this change.

To fully answer the question, though, I think that there are two approaches available to you, depending on what you want to accomplish.

Your question is how to apply differential shrinking in glmnet and retrieve the coefficients for a specific value $\lambda$. Supplying penalty.factor s.t. some values are not 1 achieves differential shrinkage at any value of $\lambda$. To achieve shrinkage s.t. the shrinkage for each $b_j$ is $\phi_j= \frac{\log T}{T|b_j^*|}$, we just have to do some algebra. Let $\phi_j$ be the penalty factor for $b_j$, what would be supplied to penalty.factor. From the documentation, we can see that these values are re-scaled by a factor of $C\phi_j=\phi^\prime_j$ s.t. $m=C\sum_{j=1}^m \frac{\log T}{T|b^*_j|}$. This means that $\phi_j^\prime$ replaces $\phi_j$ in the below optimization expression. So solve for $C$, supply the values $\phi_j^\prime$ to glmnet, and then extract coefficients for $\lambda=1$. I would recommend using coef(model, s=1, exact=T).
The second is the "standard" way to use glmnet: One performs repeated $k$-fold cross-validation to select $\lambda$ such that you minimize out-of-sample MSE. This is what I describe below in more detail. The reason we use CV and check out-of-sample MSE is because in-sample MSE will always be minimized for $\lambda=0$, i.e. $b$ is an ordinary MLE. Using CV while varying $\lambda$ allows us to estimate how the model performs on out-of-sample data, and select a $\lambda$ that is optimal (in a specific sense).

That glmnet call doesn't specify a $\lambda$ (nor should it, because it computes the entire $\lambda$ trajectory by default for performance reasons). coef(fits,s=something) will return the coefficients for the $\lambda$ value something. But no matter the choice of $\lambda$ you provide, the result will reflect the differential penalty that you applied in the call to fit the model.

The standard way to select an optimal value of $\lambda$ is to use cv.glmnet, rather than glmnet. Cross-validation is used to select the amount of shrinkage which minimizes out-of-sample error, while the specification of penalty.factor will shrink some features more than others, according to your weighting scheme.

This procedure optimizes

$$ \underset{b\in \mathbb{R}^{m} }{\min} \sum_{t=1}^{T}(y_{t}-b^{\top} X_{t} )^{2} + \lambda \sum_{j=1}^{m} ( \phi_{j}|b_{j}| ) $$

where $\phi_j$ is the penalty factor for the $j^{th}$ feature (what you supply in the penalty.factor argument). (This is slightly different from your optimization expression; note that some of the subscripts are different.) Note that the $\lambda$ term is the same across all features, so the only way that some features are shrunk more than others is through $\phi_j$. Importantly, $\lambda$ and $\phi$ are not the same; $\lambda$ is scalar and $\phi$ is a vector! In this expression, $\lambda$ is fixed/assumed known; that is, optimization will pick the optimal $b$, not the optimal $\lambda$.

This is basically the motivation of glmnet as I understand it: to use penalized regression to estimate a regression model that is not overly-optimistic about its out-of-sample performance. If this is your goal, perhaps this is the right method for you after all.

Related Solutions

Solved – GLMNET or LARS for computing LASSO solutions

In my experience, LARS is faster for small problems, very sparse problems, or very 'wide' problems (much much more features than samples). Indeed, its computational cost is limited by the number of features selected, if you don't compute the full regularization path. On the other hand, for big problems, glmnet (coordinate descent optimization) is faster. Amongst other things, coordinate descent has a good data access pattern (memory-friendly) and it can benefit from redundancy in the data on very large datasets, as it converges with partial fits. In particular, it does not suffer from heavily correlated datasets.

The conclusion that we (the core developers of the scikit-learn) have come too is that, if you do not have strong a priori knowledge of your data, you should rather use glmnet (or coordinate descent optimization, to talk about an algorithm rather than an implementation).

Interesting benchmarks may be compared in Julien Mairal's thesis:

https://lear.inrialpes.fr/people/mairal/resources/pdf/phd_thesis.pdf

Section 1.4, in particular 1.4.5 (page 22)

Julien comes to slightly different conclusions, although his analysis of the problem is similar. I suspect this is because he was very much interested in very wide problems.

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Finally we were able to produce the same solution with both methods! First issue is that glmnet solves the lasso problem as stated in the question, but lars has a slightly different normalization in the objective function, it replaces $\frac{1}{2N}$by $\frac{1}{2}$. Second, both methods normalize the data differently, so the normalization must be swiched off when calling the methods.

To reproduce that, and see that the same solutions for the lasso problem can be computed using lars and glmnet, the following lines in the code above must be changed:

la <- lars(X,Y,intercept=TRUE, max.steps=1000, use.Gram=FALSE)

la <- lars(X,Y,intercept=TRUE, normalize=FALSE, max.steps=1000, use.Gram=FALSE)

and

glm2 <- glmnet(X,Y,family="gaussian",lambda=0.5*la$lambda,thresh=1e-16)

glm2 <- glmnet(X,Y,family="gaussian",lambda=1/nbSamples*la$lambda,standardize=FALSE,thresh=1e-16)

Best Answer

Related Solutions

Solved – GLMNET or LARS for computing LASSO solutions

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Related Question