Solved – Different AIC values for the same model using step()

aicgeneralized linear modelmodel selection

I'm working with a GLM to try and optimize the model, and there are 152 predictive variables. A LOT of these are not significant, so I'm trying to figure out which ones to remove through use of the step() function. However, I'm having trouble interpreting the AIC output because it seems as if the models chosen have higher AICs than the original models in almost every case. Let's take a look at a garbage dataset I invented just for example:

> x = c(1,2,3,4,5)
> x2 = c(234,2134,432,1,41)
> x3 = c(123,456,789,123,456)
> y = c(1234,4321,1234,3456,7654)
> bull = glm(y ~ x + x2 + x3,family=Gamma(link=log))
> step(bull,direction="backward")
Start:  AIC=89.13
y ~ x + x2 + x3

       Df Deviance    AIC
- x3    1  0.65042 **88.561**
<none>     0.27488 89.127
- x2    1  1.34701 91.222
- x     1  2.43317 95.370

Step:  AIC=91.5
y ~ x + x2


Call:  glm(formula = y ~ x + x2, family = Gamma(link = log))

Coefficients:
(Intercept)            x           x2  
  6.3315474    0.4664784    0.0004815  

Degrees of Freedom: 4 Total (i.e. Null);  2 Residual
Null Deviance:      2.434 
Residual Deviance: 0.6504       **AIC: 91.5**

Notice the starred AIC values are different, and the bottom-most one is higher than the original? Yet, the model is chosen based on an AIC (also starred) that is lower. I'm wondering how to reconcile those 2 values and what exactly this model is doing.

Best Answer

The first starting AIC value is 89.13. This is for the (full) model i.e. y ~ x + x2 + x3. Now I will explain the following part in the first step:

Df Deviance    AIC
- x3    1  0.65042 **88.561**
<none>     0.27488 89.127
- x2    1  1.34701 91.222
- x     1  2.43317 95.370

The AIC in the first row above means that if you drop x3 from y ~ x + x2 + x3, you will get the AIC value of 88.561. The 2nd row is just the AIC for model y ~ x + x2 + x3. The AIC in the 3rd row is for the case when you drop x2 from y ~ x + x2 + x3 and the last one is when you drop x from y ~ x + x2 + x3. In other words, you drop each term one at a time and obtain the corresponding AIC values. Note that the AIC's are sorted in ascending order. So what is the conclusion in the first step? Well, if we remove x3 the AIC is lower (i.e. a better model) compared to the other three cases, so x3 should be dropped from y ~ x + x2 + x3. That's why you see y ~ x + x2 in the 2nd step. The AIC for this model is 91.5. Now you don't see any term dropping in this step. This is because by dropping either x or x2, the AIC does not improve, there we keep them in the model.

By now you may probably know the answer to your question about those two different AIC values. This is because in the backward elimination, the procedure will just look at the current model (in that step) and tries to improve it by dropping unnecessary variables. The algorithm does not look at the previous steps. Finally note that AIC is a function of two things: 1) Number of parameters in the model and 2) Log likelihood. So you may have smaller number of parameters (like the final model above) but at the end when considering the penalty and the log likelihood, you will end up with a higher AIC than your full model.

I would like to add one last thing on how the AIC's are computed in the step function. It has been explained in the help files here. There are some scaling involved. See for example the following codes.

> #full model
> full = lm(y ~ x + x2 + x3)
> #null model
> null=lm(y ~1)
> 
> #Forward selection
> step(null,scope=list(lower=null, upper=full),direction="forward")
Start:  AIC=79.72
y ~ 1

       Df Sum of Sq      RSS    AIC
+ x     1  14340062 13829302 78.164
<none>              28169365 79.722
+ x2    1     10913 28158452 81.720
+ x3    1      5474 28163891 81.721

Step:  AIC=78.16
y ~ x

       Df Sum of Sq      RSS    AIC
<none>              13829302 78.164
+ x2    1   4027412  9801890 78.443
+ x3    1    426975 13402328 80.008

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
      -12.7       1197.5  

> 
> #Different AIC values
> AIC(lm(y ~1))
[1] 95.91092
> extractAIC (lm(y ~1),scale = 0)
[1]  1.00000 79.72154
> extractAIC (lm(y ~x+1),scale = 0)
[1]  2.00000 78.16431

As you can see, in the above code, when I used AIC function, it gives me 95.91092, which is not the same as 79.722 that was reported in the first step of forward selection. But when I used extractAIC function and set scale=0, it provides the identical result (79.72154). The same happens for lm(y ~x+1).