Solved – AIC formula in Introduction to Statistical Learning

aicmachine learningregression

I'm a little puzzled by a formula presented in Hastie's "Introduction to Statistical Learning". In Chapter 6, page 212 (sixth printing, available here), it is stated that:

$AIC = \frac{RSS}{n\hat\sigma^2} + \frac{2d}{n} $

For linear models with Gaussian noise, $d$ being the number of predictors and $\hat\sigma$ being the estimate of error variance. However,

$\hat\sigma^2 = \frac{RSS}{(n-2)}$

Which is stated in Chapter 3, page 66.

Which would imply:

$AIC = \frac{(n-2)}{n} + \frac{2d}{n} $

Which can't be right. Can someone point out what I'm doing incorrectly?

Best Answer

I think that you are confusing the two residual sum of squares that you have. You have one RSS to estimate the $\hat{\sigma}^2$ in the formula, this RSS is in some sense independent of the number of parameters, $p$. This $\hat{\sigma}^2$ should be estimated using all your covariates, giving you a baseline unit of error. You should call the RSS in the formula for AIC: $\text{RSS}_{p_i}$, meaning that it corresponds to model $i$ with $p$ parameters, (There may be many models with $p$ parameters). So the RSS in the formula is calculated for a specific model, while the RSS for $\hat{\sigma}^2$ is for the full model.

This is also noted in the page before, where $\hat{\sigma}^2$ is introduced for $C_p$.

So the RSS for the formula in AIC is not indepednent of $p$, it is calculated for a given model. Introducing $\hat{\sigma}^2$ to all of this is just to have a baseline unit for the error, such that there is a "fair" comparison between the number of parameters and the reduction in error. You need to compare the number of parameters to something that is scaled w.r.t. the magnitude of the error.

If you would not scale the RSS by the baseline error, it might be that the RSS is dropping much more than the number of variables introduced and thus you become more greedy in adding in more variables. If you scale it to some unit, the comparison to the number of parameters is independent of the magnitude of the baseline error.

This is not the general way to calculate AIC, but it essentially boils down to something similar to this in cases where it is possible to derive simpler versions of the formula.

Best Answer

Related Solutions

Solved – Gaussian Process Regression for piecewise linear response functions

Solved – In Bayesian Information Criterion (BIC), why does having bigger n get penalized

Related Question