Model Selection – Why AIC and P-Value Variable Selection Can’t Be Used Together in Model Building

aicmodel selectionp-value

In our assignment we were asked to model the compressive strength of concrete (response variable: Strength) with predictor variables Cement (kg/m^3), Water (kg/m^3), and Coarse.Aggregate (kg/m^3). We had to use backwards stepwise model selection from a complete second order model to produce the final model.

#complete second order model
full.model <- lm(Strength ~ (Cement + Water + Coarse.Aggregate)^2 +
                I(Cement^2) + I(Water^2) + I(Coarse.Aggregate^2),
                data = concrete.df)

The statistical analysis was carried out in RStudio and below is the output from the stepwise model selection

# backwards stepwise model selection
step(full.model, direction = "backward")

# output of the final step
Step:  AIC=519.54
Strength ~ Cement + Water + Coarse.Aggregate + I(Water^2) + Water:Coarse.Aggregate

                         Df Sum of Sq   RSS    AIC
<none>                                16004 519.54
- I(Water^2)              1    691.74 16695 521.77
- Water:Coarse.Aggregate  1   2490.36 18494 532.00
- Cement                  1   3105.32 19109 535.27

I attempted to determine if this model was the best fit for the data by looking at the ANOVA table

# anova table
anova.lm(final.model)

# output 
Analysis of Variance Table

Response: Strength
                       Df  Sum Sq Mean Sq F value    Pr(>F)    
Cement                  1  2612.8 2612.80 15.3468 0.0001696 ***
Water                   1  2126.3 2126.27 12.4890 0.0006370 ***
Coarse.Aggregate        1   945.3  945.27  5.5522 0.0205336 *  
I(Water^2)              1    34.3   34.26  0.2012 0.6547702    
Water:Coarse.Aggregate  1  2490.4 2490.36 14.6276 0.0002355 ***
Residuals              94 16003.6  170.25                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Based on the Anova table I concluded that the I(Water^2) term was not significant given the p-value > then the threshold 0.05 and removed that term from the final model.

# final model
final.model <- lm(Strength ~ Cement + Water + Coarse.Aggregate +
                      Coarse.Aggregate:Water, data = concrete.df)

The assignment was marked and the feedback I was provided mentioned that the ANOVA is unnecessary and I shouldn't combine using AIC and p-value variable selection within the same model building exercise.

My question is: why is this the case? I have tried looking for resources to provide the reasoning behind this with no luck.

Note: this is a second year statistics course and it may be outside of the scope of what we are learning right now but I need to know! Thanks!

Sorry if this is not the right forum to ask, let me know if there is a better place to ask this question.

Best Answer

AIC is more a model selection method in which you do not favour some null hypothesis.

Contrary to this, with a hypothesis test (and with p-values) you choose to reject or not reject a null hypothesis based on whether some (subjective) significance criterium is being obtained or not. The hypothesis test generally favours the null hypothesis. With a high significance level requirement you can choose to not reject the null hypothesis and not select a more complex model, even when the more complex model is improving the likelihood by a lot.

The selection based on AIC is very much related to a Wilks' hypothesis test (which approximates the distribution of the likelihood ratio by a chi squared distribution). When you compare a full model with a model with one parameter less then the difference in AIC is related to the log likelihood ratio

$$\lambda_{LR} = - 2 \log \left(\frac{L_0}{L_1}\right) = -2\left( \log L_0 - \log L_1 \right) = AIC_1 - AIC_2 -2 \sim \chi^2 $$

Where here the $\sim$ means assymptoticaly equal in distribution (more often it means 'is distributed as').

For instance when we compare the full model (AIC = 519.54) with the model without the cement variable (AIC = 535.27) then the log likelihood ratio is 11.73 and this corresponds to a p-value of 0.0002110562 much similar to the p-value of your ANOVA test which is 0.0002355.

When you select a model based on the lowest AIC then you are effectively using a boundary value for the log likelihood value of 2 (if the difference is only one parameter) and this corresponds to a p-value of 0.157, much less strict than most significance values like 0.05, 0.01 or smaller.


The reason that you should not mix AIC and F-test based selection is because they are different philosophies.

The hypothesis test is subjective and the philosophy behind a hypothesis test is different from model selection. With hypothesis testing you are generally placing a stronger focus/preference on the model with less variables. The smaller model is the null hypothesis, and for the rejection of the null hypothesis we require to have strong evidence.

This subjectivity is a good practice when your goal is hypothesis testing, for instance when you are testing some scientific theory and you want no doubt about the selected parameters in the model. But for model selection, when there is no obvious reason to favour a null hypothesis and a model with less parameters, then the hypothesis testing makes less sense.

Note, technically speaking it is not wrong what you did. Especially, you can do stepwise model selection based on a F-test (ANOVA) instead of based on AIC and this is done a lot. However, it is just a bit superfluous to mix them and do first selection based on AIC and then add a final step with selection based on a F-test. You could have started the F-test based selection from the start.

The function stepwise from the package StepReg allows you to do such stepwise regression based on repeated Anova tests and has parameters where you can change the default p-value of 0.157 (which roughly corresponds with AIC based selection) to some other level.

Example code:

Note that the p-value for Infant.Mortality is 0.00169, by using the selection criterium of either 0.0016 or 0.0017 the variable becomes selected or not.

> library(StepReg)
> summary(lm(Fertility ~ Education + Catholic + Infant.Mortality, dat = swiss))

Call:
lm(formula = Fertility ~ Education + Catholic + Infant.Mortality, 
    data = swiss)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.4781  -5.4403  -0.5143   4.1568  15.1187 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      48.67707    7.91908   6.147 2.24e-07 ***
Education        -0.75925    0.11680  -6.501 6.83e-08 ***
Catholic          0.09607    0.02722   3.530  0.00101 ** 
Infant.Mortality  1.29615    0.38699   3.349  0.00169 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.505 on 43 degrees of freedom
Multiple R-squared:  0.6625,    Adjusted R-squared:  0.639 
F-statistic: 28.14 on 3 and 43 DF,  p-value: 3.15e-10

> stepwise(Fertility ~., data = swiss , sle = 0.0016, select = "SL")$Process
  Step EnteredEffect RemovedEffect DF NumberEffectIn NumberParmsIn                   SL
1    0             1                1              0             1                    1
2    1     Education                1              1             2 3.65861696596238e-07
3    2      Catholic                1              2             3 0.000559833157909886
> stepwise(Fertility ~., data = swiss , sle = 0.0017, select = "SL")$Process
  Step    EnteredEffect RemovedEffect DF NumberEffectIn NumberParmsIn                   SL
1    0                1                1              0             1                    1
2    1        Education                1              1             2 3.65861696596238e-07
3    2         Catholic                1              2             3 0.000559833157909886
4    3 Infant.Mortality                1              3             4  0.00169375295474758

Sidenote. The reason that you ended up removing the I(water^2) term is slightly different. It is because the ANOVA is computed by sequentially removing the terms. You have been comparing the removal of I(water^2) after first already removing Water:Coarse.Aggregate.