For AIC/BIC selection it doesn't really matter whether you choose to count the variance parameter, as long as you are consistent across models, because inference based on information-theoretic criteria only depends on the differences between the values for different models. Thus, adding 2 (for an additional parameter) across the board doesn't change any of the delta-*IC values, which is all you are using to choose models (or do model averaging, compute model weights, etc. etc.).
(However, you do have to be careful if you're going to compare models fitted with different procedures or different software packages, because they may count parameters in different ways.)
It does matter if you are going to use AICc or some other finite-size-corrected criterion, because then the residual information in the data set is used (the denominator of the correction term is $n-k-1$). Then the question you have to ask is whether a nuisance parameter such as the residual variance, which can be computed from the residuals without modifying the estimation procedure, should be included. I wrote in this r-sig-mixed-models post that I'm not sure about the right procedure here. However, looking quickly at Hurvich and Tsai's original paper (Hurvich, Clifford M., and Chih-Ling Tsai. 1989. “Regression and Time Series Model Selection in Small Samples.” Biometrika 76 (2) (June 1): 297–307, doi:10.1093/biomet/76.2.297, http://biomet.oxfordjournals.org/content/76/2/297.abstract), it does appear that they include the variance parameter, i.e. they use $k=m+1$ for a linear model with $m$ linear coefficients ...
I would further quote Press et al. (Numerical Recipes in C) that:
We might also comment that if the difference between $N$ and $N-1$
ever matters to you, then you are probably up to no good anyway - e.g.
trying to substantiate a questionable hypothesis with marginal data.
(They are discussing the bias correction term in the sample variance calculation, but the principle applies here as well.)
For anyone interested, this paper helped me a lot further:
Estimating Predictive Variances with Kernel Ridge Regression by G. C. Cawley
, N. L. C. Talbot and O. Chapelle
Note that this paper explains how to do it for heteroscedastic noise, however you can also use it for homogeous noise as follows.
The first method, KRR + LOO:
- Train the KRR model on the data
- For each sample, compute the LOO error
- To estimate the $\sigma^2$ compute the sum of squared residuals that were found in step 2
A more detailed method is KRR + LOO + KRR.
- Train the KRR model on the data
- For each sample, compute the LOO error
- Train a KRR model on the squared residuals that were found in step 2
- The predictions of the second model are the prediction variances, however now we have to regularize two models, but we do get heteroscedastic variances. I guess I could also take the mean prediction of the second KRR model to find the homogeous noise term $\sigma^2$, but I'm unsure if this is useful or better than the above method.
Finally, they propose a new model, that will directly take heteroscedastic noise into account. This way, the model can be regularized more in noisy regions in feature space, and thus this model is expected to perform better.
In the end what I am really interested in is to estimate the probability distribution of the prediction on unseen / new data. The methods that I am now considering are:
Performing the KRR + LOO method to estimate $\sigma^2$. Now I can estimate the variance of new data using:
$$ \hat{y_{new}} = K_{x_{new},X_{trainset}} (K_{X_{trainset},X_{trainset}} + \lambda I)^{-1} y_{train} = P y_{train}$$
Using the identity:
$$ Cov(Xa) = X^T Cov(a) X$$
I obtain:
$$ Cov(\hat{y_{new}}) = P_{train}^T P_{train} \sigma^2 $$
I use the prediction on unseen data of the regular KRR model as the mean for the distribution.
Performing KRR + LOO + KRR. I have two choices, I either estimate $\sigma^2$ by averaging the outputs of the second KRR model and use the above formula, or I directly use the prediction of the second KRR model on new data to estimate the prediction variance of a new point.
I use the prediction on unseen data of the regular KRR model as the mean for the distribution.
- I use their new model to estimate the mean and variance at unseen data.
- I use bootstrapping to train multiple models, and I compute the variance of their predictions on the traindata. This way I can compute $\sigma^2$. Then I use the same formula as in 1. I use the prediction on unseen data of the regular KRR model as the mean for the distribution.
- I use bootstrapping to train multiple models, and I compute their predictions on my new data. I compute the variance of those predictions, to compute the prediction variance of each point.
I use the prediction on unseen data of the regular KRR model as the mean for the distribution.
What I am a bit confused about is the actual distribution of the prediction on data at a new point. Is this distribution actually a Gaussian / normally distributed?
Best Answer
AIC and ridge regression can be made compatible when certain assumptions are made. However, there is no single method of choosing a shrinkage for ridge regression thus no general method of applying AIC to it. Ridge regression is a subset of Tikhonov regularization. There are many criteria that can be applied to selecting smoothing factors for Tikhonov regularization, e.g., see this. To use AIC in that context, there is a paper that makes rather specific assumptions as to how to perform that regularization, Information complexity-based regularization parameter selection for solution of ill conditioned inverse problems. In specific, this assumes
"In a statistical framework, ...choosing the value of the regularization parameter α, and by using the maximum penalized likelihood (MPL) method....If we consider uncorrelated Gaussian noise with variance $\sigma ^2$ and use the penalty $p(x) =$ a complicated norm, see link above, the MPL solution is the same as the Tikhonov (1963) regularized solution."
The question then becomes, should those assumptions be made? The question of degrees of freedom needed is secondary to the question of whether or not AIC and ridge regression are used in a consistent context. I would suggest reading the link for details. I am not avoiding the question, it is just that one can use lots of things as ridge targets, for example, one could use the smoothing factor that optimizes AIC itself. So, one good question deserves another, "Why bother with AIC in a ridge context?" In some ridge regression contexts, it is difficult to see how AIC could be made relevant. For example, ridge regression has been applied in order to minimize the relative error propagation of $b$, that is, min$\left [ \dfrac{\text{SD}(b)}{b}\right ]$ of the gamma distribution (GD) given by
$$ \text{GD}(t; a,b) = \,\dfrac{1}{t}\;\dfrac{e^{-b \, t}(b \, t)^{\,a} }{\Gamma (a)} \;\; \;;\hspace{2em}t\geq 0 \;\; \;\;,\\ %\tabularnewline $$
as per this paper. In particular, this difficulty arises because in that paper, it is, in effect, the Area Under the $[0,\infty)$ time Curve (AUC) that is optimized, and not the maximum likelihood (ML) of goodness of fit between measured $[t_1,t_n]$ time-samples. To be clear, that is done because the AUC is an ill-posed integral, and, otherwise, e.g., using ML, the gamma distribution fit would lack robustness for a time series that is censored (e.g., the data stops at some maximum time, and ML does not cover that case). Thus, for that particular application, maximum-likelihood, thus AIC, is actually irrelevant. (It is said that AIC is used for prediction and BIC for goodness-of-fit. However, prediction and goodness-of-fit are both only rather indirectly related to a robust measure of AUC.)
As for the answer to the question, the first reference in the question text says that "The main point is to note that $df$ is a decreasing function of $\lambda$ [Sic, the smoothing factor] with $df = p$ [Sic, the effective number of parameters see trace of hat matrix below] at $\lambda = 0$ and $df = 0$ at $\lambda=\infty$." Which means that $df$ equals the number of parameters minus the number of quantities estimated, when there is no smoothing which is also when the regression is the same as ordinary least squares and decreases to no $df$ as the smoothing factor increases to $\infty$. Note that for infinite smoothing the fit is a flat line irrespective of what density function is being fit. Finally, that the exact number of $df$ is a function.
"One can show that $df_{ridge}= \sum(\lambda_i / (\lambda_i + \lambda$ ), where {$\lambda_i$} are the eigenvalues of $X^{\text{T}} X$." Interestingly, that same reference defines $df$ as the trace of the hat matrix, see def.