Solved – AIC with test data, is it possible

aicmodel selection

When performing AIC (or BIC) model selection between competing models, one performs individual fits for all models (on the same training data set) and then compares their AIC (or BIC etc) values. If the difference between the lowest AIC value for some model is greater than 10 we can say that the model selection is conclusive. There is a great deal of discussion on the topic in Burnham and Anderson.

My problem: I am fitting two parametric models of different functional form (same number of parameters, same type of response), say $y = \rho_1(r|\theta_1)$, and $y = \rho_2(r|\theta_2)$ that give fits of similar quality and their AICc difference is on the order of $\Delta \mathrm{AICc}\sim 2$ (which is not conclusive). Now, I know that one of these models is the "true" one (in the sense that it was the model from which the data were created), and this comes out to have the best AICc value. Since both models have same number of parameters, it is essentially their mean square error loss that makes the selection. However if I use a sample of test data (after training), the difference of their (average) square error loss climbs up to $\sim 10$.

My questions:

a. Is it possible to use AICc with test data (i.e. evaluate square error loss on test data, but the parameters of the model were obtained from a training data set) and perform model selection? My intuition says yes, but I have no theoretical justification for this. In addition I have no reference cut off value for conclusive model selection if I use test data set.

b. Is there some kind of error based model selection on test data that can be conclusive (in the sense that AICc is conclusive when the difference $\Delta \mathrm{AICc} > 10$)?

PS. I understand that "all models are wrong", in the sense that they are approximations of the data realization and perhaps the very meaning of conclusive model selection doesn't make much sense.

Best Answer

The quantity that AIC / AICc estimates is the expected out-of-sample log-likelihood (see Burnham & Anderson 2004, Sec. 2.2), $$ \mathbf E_y \mathbf E_x [ \log g(x | \hat \theta (y)) ], $$ (multiplied with $-2$). This formula means, you obtain maximum likelihood parameter estimates from one sample, $\hat \theta (y)$, then compute the logarithm of the likelihood $g$ of these parameters on another independent sample $x$ (from the same source), and then average across infinitely many realizations of both samples, $\mathbf E_y \mathbf E_x$.

Akaike's main result is that $\log g(x | \hat \theta (x)) - K$ is an asymptotically unbiased estimator of the quantity given above, where $K$ is the number of parameters. Equivalently, $$ \mathrm{AIC} = - 2 \log g(x | \hat \theta (x)) + 2 K $$ is an asymptotically unbiased estimator of $$ - 2 ~\mathbf E_y \mathbf E_x [ \log g(x | \hat \theta (y)) ]. $$ If the sample size is small compared to the number of parameters $K$, a better correction term (AICc; see Burnham & Anderson 2004, Sec. 7.7.6, and McQuarrie & Tsai 1998) may be necessary which is more complicated than $2K$, and it is different for different models (likelihood functions $g$).


If you use cross-validation with $$ - 2 \log g(x_\mathrm{test} | \hat \theta (x_\mathrm{train})) $$ as the error measure, you are estimating (almost) the same thing.

The differences between AIC / AICc and cross-validation are:
– AIC / AICc formulas are based on an approximation for large samples (they are only asymptotically correct).
– For cross-validation, in order to have test data $x_\mathrm{test}$ the training (estimation) data $x_\mathrm{train}$ must be smaller than the original sample $x$. Moreover, the cross-validated measure depends on two random samples instead of just one, so it may be noisier.


Answer a: AIC / AICc and cross-validation are alternative methods to do the same thing (if log-likelihood is the error measure). It therefore does not make sense to use them together.

Answer b: Since AIC / AICc and cross-validation with -2 log likelihood estimate (almost) the same quantity, it should be OK to apply the same scale to evaluate differences between models.