Solved – Lasso, Ridge and Best Subset estimator for orthogonal cases

estimatorslassoregressionregularizationridge regression

I am reading the book "Elements of Statistical Learning". In the book the author compares the OSL estimator with Lasso, Ridge and Best Subset for the special case of Orthogonal X. I am attaching the particular estimator. I am able to derive the estimator for Ridge but am finding it hard to solve for Best Subset and Lasso. How exactly do they arrive at the final formula?

enter image description here

Best Answer

From the context, I'm assuming that the $\beta_j's$ are the regular least squares estimates, and the table is showing how they would be transformed under each of the listed methods.

Best Subset:

Because the columns are orthonormal, the least squares coefficients are simply $\hat{B_j} = {x_j^{T}y}$. (Orthogonality implies that they're given by $\hat{B_j} = {\frac{x_j^{T}y}{x_j^{T}x_j}}$, but since we have orthonormal columns, ${x_j^{T}x_j}$ = 1.)

Then by definition of best subset, we're looking for the $M$ predictors that gives the smalles residual sum of squares. This is equivalent to finding the $M$ largest (in absolute value) coefficients. This might already be intuitive, but if not, note that the residual sum of squares from regressing ${y}$ on ${x_j}$ is given by:

$r_j = (y - x_j\hat{\beta_j})^T(y - x_j\hat{\beta_j})$

$= y^Ty - 2\hat{\beta_j}x_j^Ty + \hat{{\beta_j}}^2$

$= y^Ty - 2(x_j^Ty)^2 + (x_j^Ty)^2$ (applying the solution of $\hat{B_j} = {x_j^{T}y}$)

$= y^Ty-(x_j^Ty)^2$

$= y^Ty - {|\hat{B_j}|}^2$

Which is clearly minimized by having $|\hat{B_j}|$ as large as possible.

It follows then that the solution for best subset with $M$ predictors is to regress $y$ on each $x_j$, order the coefficients by size in absolute value, and then choose the $M$ largest of them, which is what is given by the solution in the table.

Lasso:

The lasso coefficient for regressing $y$ on $x_j$ is finding the $\hat{\beta}$ that minimizes $\frac{1}{2}(y - x_j\hat{\beta})^T(y - x_j\hat{\beta}) + \lambda|\hat{\beta}|$. Now assume that $\hat{\beta} \neq 0$. Taking the derivative of that expression with respect to $\hat{\beta}$ and setting equal to 0 gives

$-x_j^T(y - x_j\hat{\beta}) + sign(\hat{\beta})\lambda = 0$, where we need the sign operator because the derivative of $|\hat{\beta}|$ is $1$ if $\hat{\beta}$ > 0 and $-1$ otherwise.

Simplifying the expression above gives

$-x_j^Ty + x_j^Tx_j\hat{\beta} + sign(\hat{\beta})\lambda = 0$

$\implies \hat{\beta} = x_j^Ty - sign(\hat{\beta})\lambda$ (where we used the fact that $x_j^Tx_j = 1$, since the columns are orthonormal.

$\implies \hat{\beta} = \hat{\beta_j} - sign(\hat{\beta})\lambda$ (recall the definition of $\hat{\beta_j}$, the least squares solution).

Now we consider cases for the sign of $\hat{\beta}$:

  1. If $sign(\hat{\beta}) > 0$, then we must have $\hat{\beta_j} - \lambda > 0$, which means $\hat{\beta_j} > \lambda$ (and therefore $\hat{\beta_j} > 0)$.
  • Note that if this is the case, then the lasso estimate is given by $\hat{\beta} = \hat{\beta_j} - \lambda = \hat{\beta_j} - \lambda = sign(\hat{\beta_j})(|\hat{\beta_j}| - \lambda)$
  1. If $sign(\hat{\beta}) < 0$, then we must have $\hat{\beta_j} + \lambda < 0$, which means $-\hat{\beta_j} >\lambda$ (and therefore $\hat{\beta_j} < 0)$.
  • Note that if this is the case, then the lasso estimate is given by $\hat{\beta} = \hat{\beta_j} + \lambda = -|\hat{\beta_j}| + \lambda = sign(\hat{\beta_j})(|\hat{\beta_j}| - \lambda)$

In each of these, we required that $|\hat{\beta_j}| > \lambda$. If that was wrong, or our initial assumption that $\hat{\beta} \neq 0$ must have been wrong and we have $\hat{\beta} = 0$, which means we can say that we only take the positive part ($(|\hat{\beta_j}| - \lambda)_+$) in each of the solutions since otherwise $\hat{\beta} = 0$.

Therefore, you get the solution in the table.