From the glmnet
documentation (?glmnet
), we see that it is possible to perform differential shrinkage. This gets us at least part-way to answering OP's question.
penalty.factor
: Separate penalty factors can be applied to each coefficient. This is a number that multiplies lambda
to allow differential shrinkage. Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model. Default is 1 for all variables (and implicitly infinity for variables listed in exclude
). Note: the penalty factors are internally rescaled to sum to nvars
, and the lambda
sequence will reflect this change.
To fully answer the question, though, I think that there are two approaches available to you, depending on what you want to accomplish.
Your question is how to apply differential shrinking in glmnet
and retrieve the coefficients for a specific value $\lambda$. Supplying penalty.factor
s.t. some values are not 1 achieves differential shrinkage at any value of $\lambda$. To achieve shrinkage s.t. the shrinkage for each $b_j$ is $\phi_j= \frac{\log T}{T|b_j^*|}$, we just have to do some algebra. Let $\phi_j$ be the penalty factor for $b_j$, what would be supplied to penalty.factor
. From the documentation, we can see that these values are re-scaled by a factor of $C\phi_j=\phi^\prime_j$ s.t. $m=C\sum_{j=1}^m \frac{\log T}{T|b^*_j|}$. This means that $\phi_j^\prime$ replaces $\phi_j$ in the below optimization expression. So solve for $C$, supply the values $\phi_j^\prime$ to glmnet
, and then extract coefficients for $\lambda=1$. I would recommend using coef(model, s=1, exact=T)
.
The second is the "standard" way to use glmnet
: One performs repeated $k$-fold cross-validation to select $\lambda$ such that you minimize out-of-sample MSE. This is what I describe below in more detail. The reason we use CV and check out-of-sample MSE is because in-sample MSE will always be minimized for $\lambda=0$, i.e. $b$ is an ordinary MLE. Using CV while varying $\lambda$ allows us to estimate how the model performs on out-of-sample data, and select a $\lambda$ that is optimal (in a specific sense).
That glmnet
call doesn't specify a $\lambda$ (nor should it, because it computes the entire $\lambda$ trajectory by default for performance reasons). coef(fits,s=something)
will return the coefficients for the $\lambda$ value something
. But no matter the choice of $\lambda$ you provide, the result will reflect the differential penalty that you applied in the call to fit the model.
The standard way to select an optimal value of $\lambda$ is to use cv.glmnet
, rather than glmnet
. Cross-validation is used to select the amount of shrinkage which minimizes out-of-sample error, while the specification of penalty.factor
will shrink some features more than others, according to your weighting scheme.
This procedure optimizes
$$
\underset{b\in \mathbb{R}^{m} }{\min} \sum_{t=1}^{T}(y_{t}-b^{\top} X_{t} )^{2} + \lambda \sum_{j=1}^{m} ( \phi_{j}|b_{j}| )
$$
where $\phi_j$ is the penalty factor for the $j^{th}$ feature (what you supply in the penalty.factor
argument). (This is slightly different from your optimization expression; note that some of the subscripts are different.) Note that the $\lambda$ term is the same across all features, so the only way that some features are shrunk more than others is through $\phi_j$. Importantly, $\lambda$ and $\phi$ are not the same; $\lambda$ is scalar and $\phi$ is a vector! In this expression, $\lambda$ is fixed/assumed known; that is, optimization will pick the optimal $b$, not the optimal $\lambda$.
This is basically the motivation of glmnet
as I understand it: to use penalized regression to estimate a regression model that is not overly-optimistic about its out-of-sample performance. If this is your goal, perhaps this is the right method for you after all.
The fine-tuning of the penalization factor of Elastic Net during the cross validation has resulted in a penalty that shrinks all coefficients to zero.
Without being mathematically exact this seems to indicates that none of your features is very helpful. In this case Elastic Net will always predict the mean of the data it was trained on.
Your measure for accuracy is very problematic, as just predicting the mean can produce very high results.
For example, given the standard normal distribution the average absolute error is close to 0.8. Given a large sample size the range is easily around 8, giving you an accuracy of 0.9.
See here:
> set.seed(123)
> x <- rnorm(1e5)
> 1-mean(abs(x-mean(x)))/diff(range(x))
0.9056073
Best Answer
Please think very carefully about why you want confidence intervals for the LASSO coefficients and how you will interpret them. This is not an easy problem.
The predictors chosen by LASSO (as for any feature-selection method) can be highly dependent on the data sample at hand. You can examine this in your own data by repeating your LASSO model-building procedure on multiple bootstrap samples of the data. If you have predictors that are correlated with each other, the specific predictors chosen by LASSO are likely to differ among models based on the different bootstrap samples. So what do you mean by a confidence interval for a coefficient for a predictor, say predictor $x_1$, if $x_1$ wouldn't even have been chosen by LASSO if you had worked with a different sample from the same population?
The quality of predictions from a LASSO model is typically of more interest than are confidence intervals for the individual coefficients. Despite the instability in feature selection, LASSO-based models can be useful for prediction. The selection of 1 from among several correlated predictors might be somewhat arbitrary, but the 1 selected serves as a rough proxy for the others and thus can lead to valid predictions. You can test the performance of your LASSO approach by seeing how well the models based on multiple bootstrapped samples work on the full original data set.
That said, there is recent work on principled ways to obtain confidence intervals and on related issues in inference after LASSO. This page and its links is a good place to start. The issues are discussed in more detail in Section 6.3 of Statistical Learning with Sparsity. There is also a package selectiveInference in R that implements these methods. But these are based on specific assumptions that might not hold in your data. If you do choose to use this approach, make sure to understand the conditions under which the approach is valid and exactly what those confidence intervals really mean. That statistical issue, rather than the R coding issue, is what is crucial here.