Solved – AIC and its degrees of freedom for linear regression models

aicdegrees of freedomlassomodel selectionridge regression

I have a dataset $S$ with $D$ features and three fitted linear regression models:
Model1. Ridge regression that is fitted on all $D$ features from $S$.
Model2. Ridge regression that is fitted on some $d < D$ features from $S$.
Model3. Lasso regression that is fitted on all $D$ features from $S$, and only $m < D$ features got non-zero weights (coefficients) after fitting.

I want to use AIC to select the best model. We know that AIC formula for linear regression models is the following:
$$\mathrm{AIC} = 2k + n \log{(\mathrm{RSS}/n)}.$$
where $k$ is the number or estimated parameters (degrees of freedom) and $n$ is the sample size. So we can easily calculate AIC value for all three models.

And I have two questions:
1. Can I compare AIC's values of these models and choose the best one with the lowest AIC?
I thought that the answer is yes but I became confused after reading documentation for AIC function in R package. It claims that models should be fitted on the same data. Whereas my Model2 is fitted on the (technically) different dataset (on the subset of $S$).

2. What is $k$ value for Model3?
It is clear that $k = D + 2$ for Model1 ($D$ estimates for slope parameters + intercept estimate + $\hat \sigma^2_\varepsilon$ estimate) and, similarly, $k=d+2$ for Model2.
But Model3 have only $m$ non-zero slope parameters after fitting. Does it mean that $k=m + 2$ for Model3?

Best Answer

Some preliminaries: In LASSO models, the number of non-zero coefficients is an unbiased and consistent estimate for the degrees of freedom of the lasso (see, Zou et al. (2007) "On the "degrees of freedom" of the lasso" for more details). In ridge models, the degrees of freedom are directly related to the singular values of the centred input matrix $X$ (see, Hastie et al. (2009) "Elements of Statistical Learning" in Sect. 3.4.1 for more details). Assuming the matrix $X$ has the SVD $X = USV^T$, the degree of freedom as a function of $\lambda$ are $df(\lambda) = \sum_{j=1}^p \frac{s_j^2}{s_j^2 +\lambda}.$ We can see clearly that for $\lambda \rightarrow 0$ we get $p$ degrees for freedom and for $\lambda \rightarrow \infty$ we get $0$ degrees of freedom.

Based on these and for your questions in particular:

  1. Yes, your observation is correct. M1 and M2 are not fitted on the same data. That said, the documentation mentions this because it aims to stop users from using different variants of the response variables and/or compare models that potentially have different rows of data or sample sizes. That is more obvious if we move from the "RSS"-derived calculation of AIC to the "Loglikehood"-derived one. Assuming a Gaussian log-likelihood as:

$$ \log(L(\theta)) =-\frac{|D|}{2}\log(2\pi) -\frac{1}{2} \log(|K|) -\frac{1}{2}(x-\mu)^T K^{-1} (x-\mu), $$ with $K$ being the covariance structure of our model (and $|K|$ the determinant of it), $|D|$ the number of points in our datasets, $\mu$ the mean response and $x$ our dependent variable. There is nothing invalidating this to be used on different datasets assuming we have the same dependent variable. (Notice though that for BIC we do want nested models so the model covariance structure $K$ are hierarchical). Being quite permissive with ourselves, in M2 we have an "elastic-net-like" situation where certain explanatory variables coefficients are set manually to $0$.

  1. I hope it is clear that the original calculation for $k$ for M1 and M2 are a bit oversimplifying. They need to directly account for $\lambda$. That said, you are correct that for M3, we will use the number of $m$ non-zero coefficients.