Solved – AIC and BIC criterion for Model selection, how is it used in this paper

aicbiclikelihood-ratiomodel selectionregression

I'm reading Model selection and inference: Facts and fiction by Leeb & Pötscher (2005) (link), in this paper they look at an example in linear regression:

Let $$Y_i = \alpha x_{i1}+\beta x_{i2}+\epsilon_i \qquad \epsilon_i \stackrel{d}{=}N(0,\sigma^2)$$
They denote the full, unrestricted, model as $U$ (where $\beta \not = 0$) and the restricted as $R$ (when $\beta = 0$). The least squares estimator $\hat \beta(U)$ can be calculated for the unrestricted model (it's estimator 'is' zero in the restricted model $\hat \beta(R)=0$). To decide whether to choose for the unrestricted model the following test statistic is used
$$\left| \dfrac{\sqrt{n}\hat\beta(U) }{\sigma_\beta} \right| > c \qquad \text{for a certain cutoff point } c>0$$

Then they state:

This is a traditional pretest procedure based on the likelihood ratio, but it is worth noting that in the simple example discussed here it coincides exactly with Akaike's minimum AIC rule in the case $c=\sqrt{2}$ and Schwarz's minimum BIC rule if $c=\sqrt{\ln n}$

I don't see why this is the case, I have learned the following as definition of the AIC and BIC statistics:
$$\text{AIC}_p = n\ln SSE_p – n\ln n + 2p \qquad \text{BIC}_p=n\ln SSE_p – n\ln n + p\cdot \ln n$$

Can anyone point to the connection between the statement and the definition?

Edit

I've learned OLS through Applied Linear Statistical Models by Kutner et all, there they define SSE as the sum of square errors or $\text{SSE}_p = \sum_i (Y_i-\hat y_i)^2$ in the model with $p$ parameters. Here when $p=1$ then $M_0=R$, when $p=2$ then $M_0 = U$.

I've looked at your answers but I don't follow yet. I'll try to explain the problem further.

If I look at AIC, then model $U$ would be chosen if $AIC_2 < AIC_1$, writing this out results in
$$n\ln \text{SSE}_2 – n\ln n +2\cdot 2 < n\ln\text{SSE}_1 – n\ln n +2$$
or equivalently
$$n\ln \dfrac{\text{SSE}_1}{\text{SSE}_2} > 2$$

I don't see why the left part should equal $\dfrac{n\hat \beta(U)^2}{\sigma^2_b}$.

Best Answer

In my answer here I show that in a case like the present one, in which we test nested models against each other, the minimum AIC rule selects the larger model (i.e., rejects the null) if the likelihood ratio statistic $$ \mathcal{LR}=n[\log(\widehat{\sigma}^2_1)-\log(\widehat{\sigma}^2_2)], $$ with $\widehat{\sigma}^2_i$ the ML error variance estimates of the restricted and unrestricted models, exceeds $2K_2$. Here, $K_2$ is the number of additional variables in the larger model. In your case, $K_2=1$, corresponding to $x_{i2}$. Thus, select the larger model if $\mathcal{LR}>2$.

Now, in the present linear regression framework, the absolute value of the $t$-statistic $$|t|=\left| \dfrac{\sqrt{n}\hat\beta(U) }{\sigma_\beta} \right|$$ is simply the positive square root of the LR-statistic.

(Actually, this in general only holds asymptotically, as we have $t^2=F$, the $F$- or Wald-statistic, which is in general not numerically identical to $\cal{LR}$ in finite samples. Leeb and Pötscher however assume that $\sigma^2$ is known, which, as is shown here, restores exact numerical equivalence of Wald, LR and score statistics in this setup.)

Hence, going with the larger model according to the mininum AIC rule when $\mathcal{LR}>2=c$ corresponds to rejecting when the t-statistic exceeds $\sqrt{c}$.

It is worth pointing out that this implies that, in this case, the AIC rule is nothing but a hypothesis test at level $\alpha=0.157$, as (the LR statistic being $\chi^2_1$ under the present $H_0$ of the smaller model being the correct one)

> 1-pchisq(2,df = 1)
[1] 0.1572992

or

> 2*pnorm(-sqrt(2))
[1] 0.1572992

Solving the equation $1.96=\sqrt{\ln n}$ for $n$ gives that BIC would be of the same size as a test at the 5%-level at $n\approx46$.

It does not seem to be a general result that AIC corresponds to a liberal nested hypothesis test. For example, when $K_2=8$, AIC is equivalent to rejecting when $\mathcal{LR}>16$, which, under the null, has probability

> 1-pchisq(2*8,df = 8)
[1] 0.04238011

In fact, the probability tends to zero with $K_2$:

enter image description here

Related Question