Generalized Linear Model – Significance of P-Value in Model with Lowest AIC

aicgeneralized linear modelmodeling

I am running a glm model in R, with a dataset with a large number of variables (around 25). I have checked for collinearity, and there is quite some between some of groups of variables, so I have run the tests both with all and with a selected number of variables.

I want to have an explanatory model for an outcome, possibly might progress to have a predictive model later on.

I also have a significant portion of NA's, so although I have 200 rows, I only get 90 observations.

Below is the model with least AIC, chosen with stepAIC. I have chosen a poisson GLM, since the outcome is bound from 0 to 96. Gamma GLM was not possible as I have a number of 0's.

MODEL INFO:
Observations: 90
Dependent Variable: postWOMACpain_6
Type: Linear regression 

MODEL FIT:
χ²(8) = 166.64, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.29
Pseudo-R² (McFadden) = 0.07
AIC = 411.52, BIC = 436.51 

Standard errors: MLE
---------------------------------------------------
                        Est.   S.E.   t val.      p
-------------------- ------- ------ -------- ------
(Intercept)            -1.66   1.07    -1.55   0.12
anaesthesia.FG          0.94   0.53     1.78   0.08
preWOMACpain            0.26   0.07     3.57   0.00
rs6746030TRUE          -0.83   0.56    -1.48   0.14
rs11898284TRUE          1.33   0.57     2.34   0.02
rs533586TRUE           -2.44   0.70    -3.48   0.00
rs2075572TRUE           1.04   0.64     1.63   0.11
rs609148TRUE            1.77   0.68     2.61   0.01
rs6985606TRUE           0.81   0.55     1.48   0.14
---------------------------------------------------

Estimated dispersion parameter = 5.04

My query would be:

If the AIC increases when I remove every variable with a p-value > 0.05 in a stepwise manner, then I will end up with only a couple of factors.

For instance, removing anaesthesia.FG, rs6746030, rs6985606 yields:

MODEL INFO:
Observations: 90
Dependent Variable: postWOMACpain_6
Type: Linear regression 

MODEL FIT:
χ²(5) = 138.56, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.24
Pseudo-R² (McFadden) = 0.06
AIC = 411.50, BIC = 429.00 

Standard errors: MLE
---------------------------------------------------
                        Est.   S.E.   t val.      p
-------------------- ------- ------ -------- ------
(Intercept)            -0.99   0.96    -1.03   0.31
preWOMACpain            0.26   0.07     3.52   0.00
rs11898284TRUE          1.54   0.55     2.79   0.01
rs533586TRUE           -2.04   0.69    -2.97   0.00
rs2075572TRUE           1.10   0.65     1.71   0.09
rs609148TRUE            1.21   0.64     1.90   0.06
---------------------------------------------------

Estimated dispersion parameter = 5.2

down to:

MODEL INFO:
Observations: 90
Dependent Variable: postWOMACpain_6
Type: Linear regression 

MODEL FIT:
χ²(2) = 90.72, p = 0.00
Pseudo-R² (Cragg-Uhler) = 0.16
Pseudo-R² (McFadden) = 0.04
AIC = 414.86, BIC = 424.86 

Standard errors: MLE
---------------------------------------------------
                        Est.   S.E.   t val.      p
-------------------- ------- ------ -------- ------
(Intercept)            -0.32   0.73    -0.44   0.66
preWOMACpain            0.19   0.07     2.81   0.01
rs11898284TRUE          1.62   0.57     2.84   0.01
---------------------------------------------------

Estimated dispersion parameter = 5.57

which has a higher AIC, but all variables are all significant.

Your help is much appreciated. All courses I have done do not seem to address this issue, so am not sure how to tackle it properly.

Thanks

Best Answer

I'm not sure what the actual question is here. There seems to be some concern that p-values are "not significant" for the model with the lowest AIC. That is certainly possible, especially when you have missing data.

However, please try not to be too concerned with statistical significance. Since you are interested primarily in inference:

I want to have an explanatory model for an outcome, possibly might progress to have a predictive model later on.

...then it is crucially important that you select the variables for your model in a principled way. Any stepwise procedure is a bad approach to this. More generally any approach based solely on p-values is bad. The model has absolutely no idea which of your variables is the main exposure, competing exposures, potential confounders, colliders or mediators, and as such the model cannot tell you anything about the way these variables are related causally. Moreover, when you adjust for mediators, and colliders, and when you over-adjust for confounders, you can introduce severe bias in the estimates that are of primary concern. See the accepted answer here for further details on principled variable selection for inference:
How do DAGs help to reduce bias in causal inference?

Related Solutions

Solved – Negative binomial GLM, the most complex model always has lowest AIC (all interaction terms)

Here are some options and things to consider:

First, AIC is a somewhat naive variable selection method. The fact that your model is not improved (in terms of AIC) by removing any of the variables suggests that maybe you don't want to remove anything.

Second, consider the purpose of this kind of model selection. You only have three inputs, so even with interaction terms you're not significantly improving the computational efficiency of your model by dropping terms. Moreover, it looks like you haven't evaluated whether or not you're overfitting at all, which I would suggest is the main benefit of seeking parsimonious models. You can certainly use BIC to try and find a sparser model, but you should compare that model against the full model using some other criterion, like cross validation or a hold out test set.

Third, this is the model recommended via backwards elimination. You can try building up your model via forward selection instead and see if you don't get a different (perhaps better) result. Something like this:

null.model = glm(responsevar~1, data=df)
full.model = glm(responsevar~Cov1*Cov2*Cov3, data=df)
step(null.model
    ,scope=list(lower=null.model, upper=full.model)
    ,direction="forward")

Keep in mind: stepwise regression is a greedy algorithm and is prone to getting stuck in local optima.

Finally, stepwise regression isn't your only option for model selection. You could use PCA to try to determine which variables are most significant, or use alternative regularization techniques like ridge regression (L2 regularization which performs coefficient shrinkage) or lasso regression (L1 regularization which performs both variable selection and coefficient shrinkage).

library(glmnet)
lasso.model = glmnet(x, y)
ridge.model = glmnet(x, y, alpha=0)

These are of course just a handful of the various available options to you. It's entirely possible that the model you already have is perfectly suitable. Is there any particular reason you suspect that it's not?

Solved – auto.arima cannot offer best model lowest aic

As noted in the comments, you cannot compare AIC values between models with different orders of differencing.

For that reason, the order of differencing is not chosen by AIC in auto.arima. Instead, unit root tests are used.

Even after the differencing is selected, the model returned will not necessarily have the minimum AIC because various other checks are done to ensure the model is well-behaved and numerically stable. For example, the model you fit is returning NaN values for some standard errors -- a sign of numerical instability in the likelihood. Such a model would never be returned by auto.arima. It also avoids models that have roots close to the unit circle.

Best Answer

Related Solutions

Solved – Negative binomial GLM, the most complex model always has lowest AIC (all interaction terms)

Solved – auto.arima cannot offer best model lowest aic

Related Question