Solved – Things that I am not sure about “LASSO” regression method

high-dimensionallarge datalassoleast squaresmachine learning

I have read the chapters that are related to "LASSO" regression in:

  1. The elements of statistical learning (Tibshirani et al.)
  2. Statistical Learning with Sparsity: The Lasso and Generalizations. (Tibshirani et al.).

My questions:

  1. I do not understand why centering $y_i$ leads $\bar{y}$ (the sum of $y_i$ over $n$) to be $0$, and why this allows them to omit the intercept term $\beta_0$ from the optimization problem.

  2. They also mentioned that the lasso solution in general cannot be derived. However, they then obtain the solution ($\hat{\beta}_\text{lasso}$) by minimizing $\sum_{i=1}^{n} \| y_i – \beta x_i \|_2^2 + \lambda \|\beta\|_1$. How can these two statements be reconciled?

Could you please show me mathematically, but in an easy way, so that I can understand? I'm a beginner to machine learning.

Thanks in advance.

Best Answer

For the first question, recall that in centering we replace each value $y_i$ with $y_i - \bar y$, where $\bar y$ is the mean of the $y$ vector. Then

$$ \sum_i (y_i - \bar y) = \sum_i y_i - n \bar y = \sum_i y_i - \sum_i y_i = 0 $$

I would like to make sure, is the reason of $ \bar y $ equals zero because $ \bar y = \frac{1}{n} \sum_i (y_i − \bar y) $

Well, no. First $\bar y = 0$ is only true is $y$ is the centered response, like this

$$ \overline{y - \bar y} = 0 $$

the mean of the centered response is zero. This relies on two facts:

  • The definition of the mean of y is $\frac{1}{n} \sum_i y_i$.
  • The mean of y is a constant, it does not depend on $i$.

These two together mean that $\sum_i \bar y = n \bar y$, as we have only added a constant to itself $n$ times. Using the definition of the mean $n \bar y = \sum_i y_i$. All this together gives the cancellation I discussed in above.

Now, if $y$ is centered, then the intercept in LASSO is always zero. To see why, recall that LASSO is attempting to minimize

$$ L(\beta) = \sum_i (y_i - \sum_j \beta_j x_{ij})^2 + \lambda \sum_{j > 0} \left| \beta_j \right| $$

The intercept parameter does not appear in the penalty term, which is very important here. Since the penalty term is the only place the absolute values appear, the loss function is differentiable with respect to the intercept parameter. Therefore this partial derivative must be zero at any extrema (maximum or minimum) of the loss function. So let's compute this partial derivative

$$ \frac{\partial L}{\partial \beta_0} = -2 \sum_i(y_i - \sum_j \beta_j x_{ij}) x_{i0} $$

Now $x_{i0} = 1$ for all $i$, because this is the intercept column of $X$. Furthermore, $\sum_i x_{ij} = 0$ for all other $j$, because we always center all of the other (non-intercept) predictors. With these two relations we can simplify the above to

$$ \frac{\partial L}{\partial \beta_0} = -2 \sum_i y_i + 2 \sum_{j > 0} \beta_j \sum_i x_{ij} + 2 n \beta_0 = - 0 + 0 + 2 n \beta_0 $$

Setting to zero

$$ \beta_0 = 0 $$

Notice that this happens regardless of lambda.

could you show me please where $-2 \beta_0$ comes from?

Well, I think I broke down the general case as far as I am capable of, so maybe simplifying the problem itself a bit will illuminate whats going on. Consider the one variable case of lasso, we have one predictor $x$, and two coefficients to estimate, the intercept $\beta_0$ and the coefficient of $x$, $\beta_1$.

The loss function in this simplified case is easier to write out in explicit detail

$$ L = \sum_i (y_i - \beta_0 - \beta_1 x_i)^2 - \lambda | \beta_1 | $$

The important feature here is that $\beta_0$ only appears in the sum of squares term, not in the penalty. So $L$ is a differentiable function with respect to the variable $\beta_0$, but is not differentiable with respect to the variable $\beta_1$. So, at a minimum, we can be guaranteed that the partial derivative with respect to $\beta_0$ is zero, but we can make no such claim about $\beta_1$.

To actually compute this partial is pretty simple using the standard rules of differential calculus: the differential of the penalty term is zero because $\beta_0$ does not appear, and then it is one application of the chain rule to take the derivative of the sum of squares term

$$ \frac{\partial L}{\partial \beta_0} = -2 \sum_i (y_i - \beta_0 - \beta_1 x_i) $$

Using the fact that there response and predictors are centered, this expression reduces to

$$ 2 n \beta_0 $$

how to make derivative with respect to $\beta_0$ while it is not found in the function that you the derivative for?

It is in there. There are two terms to the loss function being minimized, the residual sum of squares and and the penalty. It is true that $\beta_0$ does not appear in the penalty term, but it does appear in the residual sum of squares. Notice that the sum in the residual sum of squares is over all $j$, while in the penalty term it is only over those $j > 0$.

For the second question, the point the books are trying to make is that there is no closed form solution for the LASSO coefficients. That is, there is no explicit, purely algebraic, procedure for determining the coefficients. For contrast, there is an algebraic procedure for solving linear regression, you just solve the equation

$$ X^t X \beta = X^t y $$

Even if there is no algebraic solution, that does not mean that there is no algorithmic solution. This is why iterative methods are used to solve the LASSO problem.

In addition, can I find $ \hat \beta_\text{lasso} $ "the coefficients" which is equal to $\sum_i (y_i - \beta x_i)^2 + \lambda \sum_j |\beta_j|$ mathematically ?

Be careful, the estimated coefficients are equal to the minimizer of the expression, not the expression itself.

As for the question, it depends on what you mean by "mathematically". You can not find them exactly by mathematical manipulation, moving around symbols and taking square roots and such. You can find them (to any desired accuracy) using iterative optimization methods, which is what is generally done is software pacakges. Also, somewhat miraculously, you can find them exactly using some very clever geometrical arguments, which is explained in the LARS paper. Unfortunately, this is only possible in very nice cases, and fails for more general models, like logistic or poisson regressions.