Solved – Is AICc ever worse than AIC

aic

I know that when the number of observations is small (one paper says <40 times the number of parameters), AICc should be used instead of AIC for model comparison.

Does this imply that when the number of observations is large, AIC should be used rather than AICc?

(I have no theoretical reason to prefer AIC over AICc, just noticed that AICc is the default for some software packages and wondered whether there are cases where the second order criterion is inappropriate.)

Best Answer

When the number of observations is large the Akaike Information Criterion (AIC) and the small-sample corrected Akaike Information Criterion (AICc) become extremely similar because AICc converges to AIC. Therefore we gain (or lose) almost nothing by switching between the two criteria. I suggest keeping AICc for consistency throughout an analysis.

Some further discussion: AIC expresses the relative expected Kullback–Leibler information $I$ between competing models. Assuming our model's density is $f_M$ and the real model is $g$ the KL information can be expressed as:

$$I(g,f_M) = \int g(x) \log(\frac{g(x)}{f_M(x;\theta)})dx$$

Notice that this very much like a likelihood ratio test; if $f_M$ and $g$ are the same the ratio $\frac{g(x)}{f_M(x;\theta)}$ equals 1 so the logarithm of it tends to 0. We can immediately re-write the above as:

$$I(g,f_M) = \int g(x) \log g(x)dx - \int g(x) \log f_M(x;\theta)dx$$

and realise that the first terms is constant, so we just care for:

$$ - \int g(x) \log f_M(x;\theta)dx$$

Now what Akaike did was to: 1. realise that while $g$ is unknown we do have observations from $g$ in terms of $X_1, X_2, \dots, X_n$. So:

$$ - \int g(x) \log f_M(x;\theta)dx \approx -\frac{1}{n}\sum_i^n \log(f_M(X_i;\theta))$$

(which is simply the negative log-likelihood for the model $M$) and 2. realise that this is a over-fitted estimate of the log-likehood as we estimate both $f_M$ and $\theta$ from the same data. Without going to further gory details, the bias is asymptotically equal to $\frac{p}{n}$ where $p$ is the number of estimated parameters by $M$. So actually what we care for is:

$$ -\frac{1}{n} \sum_{i=1}^n \log(f_M(X_i;\theta)) + \frac{p}{n}$$ where is we multiply this by $2n$ we get the AIC for model $M$:

\begin{align} AIC(M) &= -2\sum_{i=1}^n \log(f_M(X_i;\theta)) + 2p \\ &= -2 l + 2p \end{align}

So the AIC equates minus two times the maximized log-likelihood plus two times the number of estimated parameters. Hurvich and Tsai's Regression and time series model selection in small samples (2001) further showed that this corrected estimate is still biased if $n$ is not large enough. Their correction terms is $\frac{2p(p+1)}{n -(p+1)}$ and this leads to the AICc formula as:

\begin{align} AICc(M) = -2 l + 2p + \frac{2p(p+1)}{n -p -1} \end{align}

That is why AICc (second-order AIC) is advocated when sample size is relatively low; clearly as $\frac{n}{p}$ get large this later correction term tends to 0. Burnham and Anderson in Model Selection and Multi-Model Inference (2004) suggest using AICc when the ratio between the sample size $n$ and the number of parameters $p$ in the largest candidate model is small (<40) but realistically any difference between AIC and AICc will be negligible as $n$ gets large (eg. >100). I have found Takezawa's Learning Regression Analysis by Simulation (2014) Chapt. 5 "Akaike’s Information Criterion (AIC) and the Third Variance" a great resource on the matter too.