Generalized Linear Model – Significance of P-Value in Model with Lowest AIC

aicgeneralized linear modelmodeling

I am running a glm model in R, with a dataset with a large number of variables (around 25). I have checked for collinearity, and there is quite some between some of groups of variables, so I have run the tests both with all and with a selected number of variables.

I want to have an explanatory model for an outcome, possibly might progress to have a predictive model later on.

I also have a significant portion of NA's, so although I have 200 rows, I only get 90 observations.

Below is the model with least AIC, chosen with stepAIC. I have chosen a poisson GLM, since the outcome is bound from 0 to 96. Gamma GLM was not possible as I have a number of 0's.

MODEL INFO:
Observations: 90
Dependent Variable: postWOMACpain_6
Type: Linear regression 

MODEL FIT:
χ²(8) = 166.64, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.29
Pseudo-R² (McFadden) = 0.07
AIC = 411.52, BIC = 436.51 

Standard errors: MLE
---------------------------------------------------
                        Est.   S.E.   t val.      p
-------------------- ------- ------ -------- ------
(Intercept)            -1.66   1.07    -1.55   0.12
anaesthesia.FG          0.94   0.53     1.78   0.08
preWOMACpain            0.26   0.07     3.57   0.00
rs6746030TRUE          -0.83   0.56    -1.48   0.14
rs11898284TRUE          1.33   0.57     2.34   0.02
rs533586TRUE           -2.44   0.70    -3.48   0.00
rs2075572TRUE           1.04   0.64     1.63   0.11
rs609148TRUE            1.77   0.68     2.61   0.01
rs6985606TRUE           0.81   0.55     1.48   0.14
---------------------------------------------------

Estimated dispersion parameter = 5.04 

My query would be:

If the AIC increases when I remove every variable with a p-value > 0.05 in a stepwise manner, then I will end up with only a couple of factors.

For instance, removing anaesthesia.FG, rs6746030, rs6985606 yields:

MODEL INFO:
Observations: 90
Dependent Variable: postWOMACpain_6
Type: Linear regression 

MODEL FIT:
χ²(5) = 138.56, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.24
Pseudo-R² (McFadden) = 0.06
AIC = 411.50, BIC = 429.00 

Standard errors: MLE
---------------------------------------------------
                        Est.   S.E.   t val.      p
-------------------- ------- ------ -------- ------
(Intercept)            -0.99   0.96    -1.03   0.31
preWOMACpain            0.26   0.07     3.52   0.00
rs11898284TRUE          1.54   0.55     2.79   0.01
rs533586TRUE           -2.04   0.69    -2.97   0.00
rs2075572TRUE           1.10   0.65     1.71   0.09
rs609148TRUE            1.21   0.64     1.90   0.06
---------------------------------------------------

Estimated dispersion parameter = 5.2 

down to:

MODEL INFO:
Observations: 90
Dependent Variable: postWOMACpain_6
Type: Linear regression 

MODEL FIT:
χ²(2) = 90.72, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.16
Pseudo-R² (McFadden) = 0.04
AIC = 414.86, BIC = 424.86 

Standard errors: MLE
---------------------------------------------------
                        Est.   S.E.   t val.      p
-------------------- ------- ------ -------- ------
(Intercept)            -0.32   0.73    -0.44   0.66
preWOMACpain            0.19   0.07     2.81   0.01
rs11898284TRUE          1.62   0.57     2.84   0.01
---------------------------------------------------

Estimated dispersion parameter = 5.57 

which has a higher AIC, but all variables are all significant.

Your help is much appreciated. All courses I have done do not seem to address this issue, so am not sure how to tackle it properly.

Thanks

Best Answer

I'm not sure what the actual question is here. There seems to be some concern that p-values are "not significant" for the model with the lowest AIC. That is certainly possible, especially when you have missing data.

However, please try not to be too concerned with statistical significance. Since you are interested primarily in inference:

I want to have an explanatory model for an outcome, possibly might progress to have a predictive model later on.

...then it is crucially important that you select the variables for your model in a principled way. Any stepwise procedure is a bad approach to this. More generally any approach based solely on p-values is bad. The model has absolutely no idea which of your variables is the main exposure, competing exposures, potential confounders, colliders or mediators, and as such the model cannot tell you anything about the way these variables are related causally. Moreover, when you adjust for mediators, and colliders, and when you over-adjust for confounders, you can introduce severe bias in the estimates that are of primary concern. See the accepted answer here for further details on principled variable selection for inference:
How do DAGs help to reduce bias in causal inference?

Related Question