Solved – R – Lasso Regression – different Lambda per regressor

glmnetlarsrregression

I want to do the following:

1) OLS regression (no penalization term) to get beta coefficients $b_{j}^{*}$; $j$ stands for the variables used to regress. I do this by

lm.model = lm(y~ 0 + x)
betas    = coefficients(lm.model)

2) Lasso regression with a penalization term, the selection criteria shall be the Bayesian Information Criteria (BIC), given by

$$\lambda _{j} = \frac{\log (T)}{T|b_{j}^{*}|}$$

where $j$ stands for the variable/regressor number, $T$ for the number of observations, and $b_{j}^{*}$ for the initial betas obtained in step 1). I want to have regression results for this specific $\lambda_j$ value, which is different for each regressor used. Hence if there are three variables, there will be three different values $\lambda_j$.

The OLS-Lasso optimization problem is then given by

$$\underset{b\epsilon \mathbb{R}^{n} }{min} = \left \{ \sum_{t=1}^{T}(y_{t}-b^{\top} X_{t} )^{2} + T\sum_{j=1}^{m} ( \lambda_{t}|b_{j}| )\right \}$$

How can I do this in R with either the lars or glmnet package? I cannot find a way to specify lambda and I am not 100% sure if I get the correct results if I run

lars.model <- lars(x,y,type = "lasso", intercept = FALSE)
predict.lars(lars.model, type="coefficients", mode="lambda")

I appreciate any help here.


Update:

I have used the following code now:

fits.cv = cv.glmnet(x,y,type="mse",penalty.factor = pnlty)
lmin    = as.numeric(fits.cv[9]) #lambda.min
fits    = glmnet(x,y, alpha=1, intercept=FALSE, penalty.factor = pnlty)
coef    = coef(fits, s = lmin)

In line 1 I use cross validation with my specified penalty factor ($\lambda _{j} = \frac{\log (T)}{T|b_{j}^{*}|}$), which is different for each regressor.
Line 2 selects the "lambda.min" of fits.cv, which is the lambda that gives minimum mean cross-validation error.
Line 3 performs a lasso fit (alpha=1) on the data. Again I used the penalty factor $\lambda$.
Line 4 extracts the coefficients from fits which belong to the "optimal" $\lambda$ chosen in line 2.

Now I have the beta coefficients for the regressors which depict the optimal solution of the minimization problem

$$\underset{b\epsilon \mathbb{R}^{n} }{min} = \left \{ \sum_{t=1}^{T}(y_{t}-b^{\top} X_{t} )^{2} + T\sum_{j=1}^{m} ( \lambda_{t}|b_{j}| )\right \}$$

with a penalty factor $\lambda _{j} = \frac{\log (T)}{T|b_{j}^{*}|}$. The optimal set of coefficients is most likely a subset of the regressors which I initially used, this is a consequence of the Lasso method which shrinks down the number of used regressors.

Is my understanding and the code correct?

Best Answer

From the glmnet documentation (?glmnet), we see that it is possible to perform differential shrinkage. This gets us at least part-way to answering OP's question.

penalty.factor: Separate penalty factors can be applied to each coefficient. This is a number that multiplies lambda to allow differential shrinkage. Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model. Default is 1 for all variables (and implicitly infinity for variables listed in exclude). Note: the penalty factors are internally rescaled to sum to nvars, and the lambda sequence will reflect this change.

To fully answer the question, though, I think that there are two approaches available to you, depending on what you want to accomplish.

  1. Your question is how to apply differential shrinking in glmnet and retrieve the coefficients for a specific value $\lambda$. Supplying penalty.factor s.t. some values are not 1 achieves differential shrinkage at any value of $\lambda$. To achieve shrinkage s.t. the shrinkage for each $b_j$ is $\phi_j= \frac{\log T}{T|b_j^*|}$, we just have to do some algebra. Let $\phi_j$ be the penalty factor for $b_j$, what would be supplied to penalty.factor. From the documentation, we can see that these values are re-scaled by a factor of $C\phi_j=\phi^\prime_j$ s.t. $m=C\sum_{j=1}^m \frac{\log T}{T|b^*_j|}$. This means that $\phi_j^\prime$ replaces $\phi_j$ in the below optimization expression. So solve for $C$, supply the values $\phi_j^\prime$ to glmnet, and then extract coefficients for $\lambda=1$. I would recommend using coef(model, s=1, exact=T).

  2. The second is the "standard" way to use glmnet: One performs repeated $k$-fold cross-validation to select $\lambda$ such that you minimize out-of-sample MSE. This is what I describe below in more detail. The reason we use CV and check out-of-sample MSE is because in-sample MSE will always be minimized for $\lambda=0$, i.e. $b$ is an ordinary MLE. Using CV while varying $\lambda$ allows us to estimate how the model performs on out-of-sample data, and select a $\lambda$ that is optimal (in a specific sense).

That glmnet call doesn't specify a $\lambda$ (nor should it, because it computes the entire $\lambda$ trajectory by default for performance reasons). coef(fits,s=something) will return the coefficients for the $\lambda$ value something. But no matter the choice of $\lambda$ you provide, the result will reflect the differential penalty that you applied in the call to fit the model.

The standard way to select an optimal value of $\lambda$ is to use cv.glmnet, rather than glmnet. Cross-validation is used to select the amount of shrinkage which minimizes out-of-sample error, while the specification of penalty.factor will shrink some features more than others, according to your weighting scheme.

This procedure optimizes

$$ \underset{b\in \mathbb{R}^{m} }{\min} \sum_{t=1}^{T}(y_{t}-b^{\top} X_{t} )^{2} + \lambda \sum_{j=1}^{m} ( \phi_{j}|b_{j}| ) $$

where $\phi_j$ is the penalty factor for the $j^{th}$ feature (what you supply in the penalty.factor argument). (This is slightly different from your optimization expression; note that some of the subscripts are different.) Note that the $\lambda$ term is the same across all features, so the only way that some features are shrunk more than others is through $\phi_j$. Importantly, $\lambda$ and $\phi$ are not the same; $\lambda$ is scalar and $\phi$ is a vector! In this expression, $\lambda$ is fixed/assumed known; that is, optimization will pick the optimal $b$, not the optimal $\lambda$.

This is basically the motivation of glmnet as I understand it: to use penalized regression to estimate a regression model that is not overly-optimistic about its out-of-sample performance. If this is your goal, perhaps this is the right method for you after all.