Lasso Regression – Understanding the Mathematical Intuition

lassomachine learningmathematical-statisticspythonregularization

I do not fully understand lasso regression. I am able to code a LASSO regression model of the form $y = \beta_0 + \beta_1 x_1 + \ldots \beta_p x_p$ with sklearn, but I do not understand the Mathematical intuition behind the model, which I believe is important. In ISLR (Introduction to Statistical Learning in R), it is explained that the equation for the model coefficients in lasso regression is
$$\hat{\beta} = \text{argmin}_{\beta \in \mathbb{R}^{p+1}}\left(RSS + \lambda \sum_{i=1}^p|\beta_i|\right)$$ and $RSS = \sum_{j = 1}^n(y_j – \hat{y}_j)^2$ is the residual sum of squares.
So, although I do not understand where exactly the hyperparameter comes from or what its actual purpose is, I do understand that the algorithm seeks to minimise that equation and reduce overfitting by introducing a little bias into the training fit.

However, it is shown later that the lasso regression equation can be rewritten as
$$\hat{\beta} = \text{arg min}_{\beta \in \mathbb{R}^{p+1}} (RSS) \text{ subject to } \sum_{i = 1}^p |\beta_i| \leq s$$

which is where my confusion lies. Firstly, how can the original lasso regression equation be rewritten as that? Also, what is this s term? I get that it means sum, but I do not understand what sum this equation refers to or why it has to be less than or equal to the sum of the absolute beta values.

Best Answer

You have three different questions. Let's tackle them in a somewhat different order. I'll refer to ISLR 2nd edition.

First off, here are the two different formulations of the lasso:

$$ \text{minimize } ||\mathbf{y}-\mathbf{X}\mathbf{\beta}||_2+\lambda||\mathbf{\beta}||_1\text{ with respect to }\mathbf{\beta}\text{ for a given }\lambda\geq 0 \quad (6.7)$$ and $$ \text{minimize } ||\mathbf{y}-\mathbf{X}\mathbf{\beta}||_2\text{ with respect to }\mathbf{\beta}\text{ subject to }||\mathbf{\beta}||_1\leq s\text{ for a given }s\geq0\quad (6.8) $$

The 2-norm $||\cdot||_2$ of a vector is the sum of squared entries, the 1-norm $||\cdot||_1$ is the sum of absolute entries.

On to your questions:

  1. How do $\lambda$ and $s$ hang together?

    There is a one-to-one relationship between the two, in the following sense: for every $\lambda$, there is one $s$ such that the minimizer $\mathbf{\beta}$ of (6.7) for $\lambda$ is equal to the minimizer of (6.8). And vice versa. (Note that the relationship is not unique, see below.)

    One interesting boundary case is that $\lambda=0$ (no lasso penalty), which leads to the OLS estimate of $\mathbf{\beta}$ (assuming your design matrix $\mathbf{X}$ is of full rank), corresponds to $s=\infty$ (no constraint on the parameter estimates). If you are uncomfortable with $s=\infty$, just take $s$ to be any number larger than the 1-norm of the OLS estimate of $\mathbf{\beta}$.

  2. Why are the two formulations equivalent?

    This derivation is usually not given in standard statistics/data science textbooks. Most people just accept this fact. I don't think it is formally proven even in the original lasso paper by Tibshirani (1996), but it does not seem to be very hard. Starting with (6.7), pick a $\lambda$, find the optimal $\mathbf{\beta}$ in (6.7), use the 1-norm of this estimate for $s$ in (6.8) and argue that you won't find a better solution to (6.8) with smaller 1-norm. And vice versa. A rigorous proof would probably be a nice exercise for an undergraduate math student.

  3. How do we choose $\lambda$ (or equivalently $s$)?

    The algorithm usually used to fit a lasso model will give you an entire sequence of parameter coefficient vectors, one for each one of multiple values of $\lambda$ (e.g., Figure 6.6 in ISLR). You can then choose one. This is often done via cross-validation using some accuracy measure, and in fact, your software may already perform this cross-validation internally and just report the optimal $\lambda$.