I'm working with a GLM to try and optimize the model, and there are 152 predictive variables. A LOT of these are not significant, so I'm trying to figure out which ones to remove through use of the step() function. However, I'm having trouble interpreting the AIC output because it seems as if the models chosen have higher AICs than the original models in almost every case. Let's take a look at a garbage dataset I invented just for example:
> x = c(1,2,3,4,5)
> x2 = c(234,2134,432,1,41)
> x3 = c(123,456,789,123,456)
> y = c(1234,4321,1234,3456,7654)
> bull = glm(y ~ x + x2 + x3,family=Gamma(link=log))
> step(bull,direction="backward")
Start: AIC=89.13
y ~ x + x2 + x3
Df Deviance AIC
- x3 1 0.65042 **88.561**
<none> 0.27488 89.127
- x2 1 1.34701 91.222
- x 1 2.43317 95.370
Step: AIC=91.5
y ~ x + x2
Call: glm(formula = y ~ x + x2, family = Gamma(link = log))
Coefficients:
(Intercept) x x2
6.3315474 0.4664784 0.0004815
Degrees of Freedom: 4 Total (i.e. Null); 2 Residual
Null Deviance: 2.434
Residual Deviance: 0.6504 **AIC: 91.5**
Notice the starred AIC values are different, and the bottom-most one is higher than the original? Yet, the model is chosen based on an AIC (also starred) that is lower. I'm wondering how to reconcile those 2 values and what exactly this model is doing.
Best Answer
The first starting AIC value is 89.13. This is for the (full) model i.e.
y ~ x + x2 + x3
. Now I will explain the following part in the first step:The AIC in the first row above means that if you drop
x3
fromy ~ x + x2 + x3
, you will get the AIC value of 88.561. The 2nd row is just the AIC for modely ~ x + x2 + x3
. The AIC in the 3rd row is for the case when you dropx2
fromy ~ x + x2 + x3
and the last one is when you dropx
fromy ~ x + x2 + x3
. In other words, you drop each term one at a time and obtain the corresponding AIC values. Note that the AIC's are sorted in ascending order. So what is the conclusion in the first step? Well, if we removex3
the AIC is lower (i.e. a better model) compared to the other three cases, sox3
should be dropped fromy ~ x + x2 + x3
. That's why you seey ~ x + x2
in the 2nd step. The AIC for this model is 91.5. Now you don't see any term dropping in this step. This is because by dropping eitherx
orx2
, the AIC does not improve, there we keep them in the model.By now you may probably know the answer to your question about those two different AIC values. This is because in the backward elimination, the procedure will just look at the current model (in that step) and tries to improve it by dropping unnecessary variables. The algorithm does not look at the previous steps. Finally note that AIC is a function of two things: 1) Number of parameters in the model and 2) Log likelihood. So you may have smaller number of parameters (like the final model above) but at the end when considering the penalty and the log likelihood, you will end up with a higher AIC than your full model.
I would like to add one last thing on how the AIC's are computed in the
step
function. It has been explained in the help files here. There are some scaling involved. See for example the following codes.As you can see, in the above code, when I used AIC function, it gives me 95.91092, which is not the same as 79.722 that was reported in the first step of forward selection. But when I used extractAIC function and set
scale=0
, it provides the identical result (79.72154). The same happens forlm(y ~x+1)
.