Solved – Intuition for the degrees of freedom of the LASSO

degrees of freedomlassoregressionregularization

Zou et al. "On the "degrees of freedom" of the lasso" (2007) show that the number of nonzero coefficients is an unbiased and consistent estimate for the degrees of freedom of the lasso.

It seems a little counterintuitive to me.

  • Suppose we have a regression model (where the variables are zero mean)

$$y=\beta x + \varepsilon.$$

  • Suppose an unrestricted OLS estimate of $\beta$ is $\hat\beta_{OLS}=0.5$. It could roughly coincide with a LASSO estimate of $\beta$ for a very low penalty intensity.
  • Suppose further that a LASSO estimate for a particular penalty intensity $\lambda^*$ is $\hat\beta_{LASSO,\lambda^*}=0.4$. For example, $\lambda^*$ could be the "optimal" $\lambda$ for the data set at hand found using cross validation.
  • If I understand correctly, in both cases the degrees of freedom is 1 as both times there is one nonzero regression coefficient.

Question:

  • How come the degrees of freedom in both cases are the same even though $\hat\beta_{LASSO,\lambda^*}=0.4$ suggests less "freedom" in fitting than $\hat\beta_{OLS}=0.5$?

References:

Best Answer

Assume we are given a set of $n$ $p$-dimensional observations, $x_i \in \mathbb{R}^p$, $i = 1, \dotsc, n$. Assume a model of the form: \begin{align} Y_i = \langle \beta, x_i\rangle + \epsilon \end{align} where $\epsilon \sim N(0, \sigma^2)$, $\beta \in \mathbb{R}^p$, and $\langle \cdot, \cdot \rangle$ denoting the inner product. Let $\hat{\beta} = \delta(\{Y_i\}_{i=1}^n)$ be an estimate of $\beta$ using fitting method $\delta$ (either OLS or LASSO for our purposes). The formula for degrees of freedom given in the article (equation 1.2) is: \begin{align} \text{df}(\hat{\beta}) = \sum_{i=1}^n \frac{\text{Cov}(\langle\hat{\beta}, x_i\rangle, Y_i)}{\sigma^2}. \end{align}

By inspecting this formula we can surmise that, in accordance with your intuition, the true DOF for the LASSO will indeed be less than the true DOF of OLS; the coefficient-shrinkage effected by the LASSO should tend to decrease the covariances.

Now, to answer your question, the reason that the DOF for the LASSO is the same as the DOF for OLS in your example is just that there you are dealing with estimates (albeit unbiased ones), obtained from a particular dataset sampled from the model, of the true DOF values. For any particular dataset, such an estimate will not be equal to the true value (especially since the estimate is required to be an integer while the true value is a real number in general).

However, when such estimates are averaged over many datasets sampled from the model, by unbiasedness and the law of large numbers such an average will converge to the true DOF. In the case of the LASSO, some of those datasets will result in an estimator wherein the coefficient is actually 0 (though such datasets might be rare if $\lambda$ is small). In the case of OLS, the estimate of the DOF is always the number of coefficients, not the number of non-zero coefficients, and so the average for the OLS case will not contain these zeros. This shows how the estimators differ, and how the average estimator for the LASSO DOF can converge to something smaller than the average estimator for the OLS DOF.

Related Question