There is a new paper, A Significance Test for the Lasso, including the inventor of LASSO as an author that reports results on this problem. This is a relatively new area of research, so the references in the paper cover a lot of what is known at this point.
As for your second question, have you tried $\alpha \in (0,1)$? Often there is a value in this middle range that achieves a good compromise. This is called Elastic Net regularization. Since you are using cv.glmnet, you will probably want to cross-validate over a grid of $(\lambda, \alpha)$ values.
1) Ridge regression shrinks perfectly correlated predictors equally. Suppose that your true model is:
$$ Y = X_1 + X_2 + 2X_3 + \epsilon $$
Where $X_1$ and $X_2$ are perfectly correlated ($X_1 = X_2$ in distribution) and $X_3$ is uncorrelated with the other two. Then, depending on your specification of the model, you would get the following regressions:
- $X_1$ and $X_2$ in: $Y = X_1 + X_2 + 2X_3$
- Only $X_1$ in : $Y = 2X_1 + 2X_3$
- Only $X_2$ in : $Y = 2X_2 + 2X_3$
so the variable importance ranking is very dependent on what variables are available and specified as in or not in the model. Worse, a reasonable method would probably say that $X_1$ and $X_2$ are equally important, and also, as a set, equal in importance to $X_3$. It doesn't seem like there is a way to recover this from a single ridge regression.
2) You answered this already.
3) One option would be bootstrapping. You can bootstrap sample your training data, and fit a ridge on each sample. This will let you get a sample distribution of the coefficients, and you could derive intervals from these. This has similar issues to 1) though.
Best Answer
The cv.glmnet function uses k-fold cross-validation to estimate an optimal penalty term. The default for this software is to use 10 folds. So, the software fits many ridge regressions on a grid of different penalty values and then chooses the value of the penalty parameter that minimizes estimated out-of-sample prediction error, using cross-validation to estimate the out-of-sample prediction error for each choice of the parameter value.