Solved – AIC of ridge regression: degrees of freedom vs. number of parameters

aicdegrees of freedomregressionridge regression

I want to calculate the AICc of a ridge regression model. The problem is the number of parameters. For linear regression, most people suggest that the number of parameters equals the number of estimated coefficients plus sigma (the variance of error).

When it comes to ridge regression I read that the trace of the hat matrix — the degree of freedom (df) — is simply used as the number of parameters term in the AIC formula (e.g. here or here).

Is this correct? Can I also simply used the df to calculate the AICc? Can I simply add +1 to the df to account for the error variance?

Best Answer

AIC and ridge regression can be made compatible when certain assumptions are made. However, there is no single method of choosing a shrinkage for ridge regression thus no general method of applying AIC to it. Ridge regression is a subset of Tikhonov regularization. There are many criteria that can be applied to selecting smoothing factors for Tikhonov regularization, e.g., see this. To use AIC in that context, there is a paper that makes rather specific assumptions as to how to perform that regularization, Information complexity-based regularization parameter selection for solution of ill conditioned inverse problems. In specific, this assumes

"In a statistical framework, ...choosing the value of the regularization parameter α, and by using the maximum penalized likelihood (MPL) method....If we consider uncorrelated Gaussian noise with variance $\sigma ^2$ and use the penalty $p(x) =$ a complicated norm, see link above, the MPL solution is the same as the Tikhonov (1963) regularized solution."

The question then becomes, should those assumptions be made? The question of degrees of freedom needed is secondary to the question of whether or not AIC and ridge regression are used in a consistent context. I would suggest reading the link for details. I am not avoiding the question, it is just that one can use lots of things as ridge targets, for example, one could use the smoothing factor that optimizes AIC itself. So, one good question deserves another, "Why bother with AIC in a ridge context?" In some ridge regression contexts, it is difficult to see how AIC could be made relevant. For example, ridge regression has been applied in order to minimize the relative error propagation of $b$, that is, min$\left [ \dfrac{\text{SD}(b)}{b}\right ]$ of the gamma distribution (GD) given by

$$ \text{GD}(t; a,b) = \,\dfrac{1}{t}\;\dfrac{e^{-b \, t}(b \, t)^{\,a} }{\Gamma (a)} \;\; \;;\hspace{2em}t\geq 0 \;\; \;\;,\\ %\tabularnewline $$

as per this paper. In particular, this difficulty arises because in that paper, it is, in effect, the Area Under the $[0,\infty)$ time Curve (AUC) that is optimized, and not the maximum likelihood (ML) of goodness of fit between measured $[t_1,t_n]$ time-samples. To be clear, that is done because the AUC is an ill-posed integral, and, otherwise, e.g., using ML, the gamma distribution fit would lack robustness for a time series that is censored (e.g., the data stops at some maximum time, and ML does not cover that case). Thus, for that particular application, maximum-likelihood, thus AIC, is actually irrelevant. (It is said that AIC is used for prediction and BIC for goodness-of-fit. However, prediction and goodness-of-fit are both only rather indirectly related to a robust measure of AUC.)

As for the answer to the question, the first reference in the question text says that "The main point is to note that $df$ is a decreasing function of $\lambda$ [Sic, the smoothing factor] with $df = p$ [Sic, the effective number of parameters see trace of hat matrix below] at $\lambda = 0$ and $df = 0$ at $\lambda=\infty$." Which means that $df$ equals the number of parameters minus the number of quantities estimated, when there is no smoothing which is also when the regression is the same as ordinary least squares and decreases to no $df$ as the smoothing factor increases to $\infty$. Note that for infinite smoothing the fit is a flat line irrespective of what density function is being fit. Finally, that the exact number of $df$ is a function.

"One can show that $df_{ridge}= \sum(\lambda_i / (\lambda_i + \lambda$ ), where {$\lambda_i$} are the eigenvalues of $X^{\text{T}} X$." Interestingly, that same reference defines $df$ as the trace of the hat matrix, see def.