Solved – glmnet in R: Selecting the right $\alpha$

glmnet

I was reading the following link. One of the section mentions about selecting the a value for $\alpha$.

Looking at the bottom right plot below (contains MSE for 3 different values of $\alpha$), it seems to me that $\alpha = 0$ gives the lowest MSE and hence should be the best here. However, the image below says that $\alpha=1$ is the best instead.

Question: I do not understand why $\alpha=1$ is the best ? Because it seems to me that $\alpha=0$ yields lowest MSE for all values of $\log(\lambda)$.

And my interpretation is that the higher the value of $\lambda$ , the more penalization there is. And if a model yields MSE lower than other models at the same penalty factor (i.e. same $\lambda$), then surely this should be the best model. Am I wrong ?

What did I misunderstand here ?

Best Answer

If low MSE is your goal, go with $\alpha=0$ and a small value of $\lambda$ (s = lambda.1se, s = lambda.min or even something smaller). If your goal is a simpler model (with fewer than 20 variables), and then you could tune $\lambda$ using the cross validation plots but also your preference for model complexity.

I'm guessing you have enough data relative to your model that regularization is not especially beneficial. In all plots above, the cross-validation results are telling the same story: "the smaller the lambda the better." If you extrapolate that curve out, then you don't have any regularization at all and you're back to ordinary regression with 20 variables. If I had to guess, the full 20 variable model really is the best for your situation, which is why $\alpha = 0$ with a very small $\lambda$ is giving you the best MSE results - it keeps all the variables and applies very little regularization (i.e., bias).

For reasons I don't fully understand, LASSO ($\alpha = 1$) stops short of 20 variables (even for s = lambda.min) though the curve appears to be decreasing. Perhaps the default is set so that variable selection actually happens, which is presumed by the user's choice of $\alpha=1$.

Related Solutions

Solved – How to interpret glmnet

Here's an unintuitive fact - you're not actually supposed to give glmnet a single value of lambda. From the documentation here:

Do not supply a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to ﬁt a whole path than compute a single ﬁt.

cv.glmnet will help you choose lambda, as you alluded to in your examples. The authors of the glmnet package suggest cv$lambda.1se instead of cv$lambda.min, but in practice I've had success with the latter.

After running cv.glmnet, you don't have to rerun glmnet! Every lambda in the grid (cv$lambda) has already been run. This technique is called "Warm Start" and you can read more about it here. Paraphrasing from the introduction, the Warm Start technique reduces running time of iterative methods by using the solution of a different optimization problem (e.g., glmnet with a larger lambda) as the starting value for a later optimization problem (e.g., glmnet with a smaller lambda).

To extract the desired run from cv.glmnet.fit, try this:

small.lambda.index <- which(cv$lambda == cv$lambda.min)
small.lambda.betas <- cv$glmnet.fit$beta[, small.lambda.index]

Revision (1/28/2017)

No need to hack to the glmnet object like I did above; take @alex23lemm's advice below and pass the s = "lambda.min", s = "lambda.1se" or some other number (e.g., s = .007) to both coef and predict. Note that your coefficients and predictions depend on this value which is set by cross validation. Use a seed for reproducibility! And don't forget that if you don't supply an "s" in coef and predict, you'll be using the default of s = "lambda.1se". I have warmed up to that default after seeing it work better in a small data situation. s = "lambda.1se" also tends to provide more regularization, so if you're working with alpha > 0, it will also tend towards a more parsimonious model. You can also choose a numerical value of s with the help of plot.glmnet to get to somewhere in between (just don't forget to exponentiate the values from the x axis!).

Solved – Choosing optimal alpha in elastic net logistic regression

Clarifying what is meant by $\alpha$ and Elastic Net parameters

Different terminology and parameters are used by different packages, but the meaning is generally the same:

The R package Glmnet uses the following definition

$\min_{\beta_0,\beta} \frac{1}{N} \sum_{i=1}^{N} w_i l(y_i,\beta_0+\beta^T x_i) + \lambda\left[(1-\alpha)||\beta||_2^2/2 + \alpha ||\beta||_1\right]$

Sklearn uses

$\min_{w} \frac{1}{2N} \sum_{i=1}^{N} ||y - Xw ||^2_2 + \alpha \times l_1 \text{ratio} ||w||_1 + 0.5 \times \alpha \times (1 - l_1 \text{ratio}) \times ||w||_2^2$

There are alternative parametrizations using $a$ and $b$ as well..

To avoid confusion i am going to call

$\lambda$ the penalty strength parameter
$L_1 \text{ratio}$ the ratio between $L_1$ and $L_2$ penalty, ranging from 0 (ridge) to 1 (lasso)

Visualizing the impact of the parameters

Consider a simulated data set where $y$ consists of a noisy sine curve and $X$ is a two dimensional feature consisting of $X_1 = x$ and $X_2 = x^2$. Due to correlation between $X_1$ and $X_2$ the cost function is a narrow valley.

The graphics below illustrate the solution path of elasticnet regression with two different $L_1$ ratio parameters, as a function of $\lambda$ the strength parameter.

For both simulations: when $\lambda = 0$ then the solution is the OLS solution on the bottom right, with the associated valley shaped cost function.
As $\lambda$ increases, the regularization kicks in and the solution tends to $(0,0)$
The main difference between the two simulations is the $L_1$ ratio parameter.
LHS: for small $L_1$ ratio, the regularized cost function looks a lot like Ridge regression with round contours.
RHS: for large $L_1$ ratio, the cost function looks a lot like Lasso regression with the typical diamond shape contours.
For intermediate $L_1$ ratio (not shown) the cost function is a mix of the two

Understanding the effect of the parameters

The ElasticNet was introduced to counter some of the limitations of the Lasso which are:

If there are more variables $p$ than data points $n$, $p>n$, the lasso selects at most $n$ variables.
Lasso fails to perform grouped selection, especially in the presence of correlated variables. It will tend to select one variable from a group and ignore the others

By combining an $L_1$ and a quadratic $L_2$ penalty we get the advantages of both:

$L_1$ generates a sparse model
$L_2$ removes the limitation on the number of selected variables, encourages grouping and stabilizes the $L_1$ regularization path.

You can see this visually on the diagram above, the singularities at the vertices encourage sparsity, while the strict convex edges encourage grouping.

Here is a visualization taken from Hastie (the inventor of ElasticNet)