I have a dataset $S$ with $D$ features and three fitted linear regression models:
Model1. Ridge regression that is fitted on all $D$ features from $S$.
Model2. Ridge regression that is fitted on some $d < D$ features from $S$.
Model3. Lasso regression that is fitted on all $D$ features from $S$, and only $m < D$ features got non-zero weights (coefficients) after fitting.
I want to use AIC to select the best model. We know that AIC formula for linear regression models is the following:
$$\mathrm{AIC} = 2k + n \log{(\mathrm{RSS}/n)}.$$
where $k$ is the number or estimated parameters (degrees of freedom) and $n$ is the sample size. So we can easily calculate AIC value for all three models.
And I have two questions:
1. Can I compare AIC's values of these models and choose the best one with the lowest AIC?
I thought that the answer is yes but I became confused after reading documentation for AIC function in R package. It claims that models should be fitted on the same data. Whereas my Model2 is fitted on the (technically) different dataset (on the subset of $S$).
2. What is $k$ value for Model3?
It is clear that $k = D + 2$ for Model1 ($D$ estimates for slope parameters + intercept estimate + $\hat \sigma^2_\varepsilon$ estimate) and, similarly, $k=d+2$ for Model2.
But Model3 have only $m$ non-zero slope parameters after fitting. Does it mean that $k=m + 2$ for Model3?
Best Answer
Some preliminaries: In LASSO models, the number of non-zero coefficients is an unbiased and consistent estimate for the degrees of freedom of the lasso (see, Zou et al. (2007) "On the "degrees of freedom" of the lasso" for more details). In ridge models, the degrees of freedom are directly related to the singular values of the centred input matrix $X$ (see, Hastie et al. (2009) "Elements of Statistical Learning" in Sect. 3.4.1 for more details). Assuming the matrix $X$ has the SVD $X = USV^T$, the degree of freedom as a function of $\lambda$ are $df(\lambda) = \sum_{j=1}^p \frac{s_j^2}{s_j^2 +\lambda}.$ We can see clearly that for $\lambda \rightarrow 0$ we get $p$ degrees for freedom and for $\lambda \rightarrow \infty$ we get $0$ degrees of freedom.
Based on these and for your questions in particular:
$$ \log(L(\theta)) =-\frac{|D|}{2}\log(2\pi) -\frac{1}{2} \log(|K|) -\frac{1}{2}(x-\mu)^T K^{-1} (x-\mu), $$ with $K$ being the covariance structure of our model (and $|K|$ the determinant of it), $|D|$ the number of points in our datasets, $\mu$ the mean response and $x$ our dependent variable. There is nothing invalidating this to be used on different datasets assuming we have the same dependent variable. (Notice though that for BIC we do want nested models so the model covariance structure $K$ are hierarchical). Being quite permissive with ourselves, in M2 we have an "elastic-net-like" situation where certain explanatory variables coefficients are set manually to $0$.