R's glm
function for generalized linear models is a logistic regression when the response is dichotomous(yes/no, male/female, etc..) and the family
parameter is passed the argument binomial
. I'm wondering how to judge if the model we built is good eough? As we know, in OLS regression some criterion like R^2 and adjusted R^2 can tell us how much variations are explained but not for GLM. See example I performed:
> summary(fit.full)
Call:
glm(formula = ynaffair ~ gender + age + yearsmarried + children +
+religiousness + education + occupation + rating, family = binomial(),
data = Affairs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6575 -0.7459 -0.5714 -0.2552 2.5099
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.71792 0.96165 0.747 0.455336
gendermale 0.28665 0.23973 1.196 0.231811
age -0.04494 0.01831 -2.454 0.014142 *
yearsmarried 0.09686 0.03236 2.993 0.002758 **
childrenyes 0.37088 0.29466 1.259 0.208147
religiousness -0.32230 0.09003 -3.580 0.000344 ***
education 0.01795 0.05088 0.353 0.724329
occupation 0.03210 0.07194 0.446 0.655444
rating2 -0.02312 0.58177 -0.040 0.968303
rating3 -0.84532 0.57619 -1.467 0.142354
rating4 -1.13916 0.55740 -2.044 0.040981 *
rating5 -1.61050 0.56649 -2.843 0.004470 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 608.22 on 589 degrees of freedom
AIC: 632.22
After removed the insignificant variables, the reduced model look like below,although the AIC decreasd, we still do not know if this is the model with the lowest AIC we can achieved:
> summary(fit.reduced)
Call:
glm(formula = ynaffair ~ age + yearsmarried + religiousness +
+rating, family = binomial(), data = Affairs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5117 -0.7541 -0.5722 -0.2592 2.4123
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.10220 0.71849 1.534 0.125014
age -0.03588 0.01740 -2.062 0.039224 *
yearsmarried 0.10113 0.02933 3.448 0.000565 ***
religiousness -0.32571 0.08971 -3.631 0.000282 ***
rating2 0.11848 0.57258 0.207 0.836068
rating3 -0.70168 0.56671 -1.238 0.215658
rating4 -0.96190 0.54230 -1.774 0.076109 .
rating5 -1.49502 0.55550 -2.691 0.007118 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 613.63 on 593 degrees of freedom
AIC: 629.63
And we perform the ANOVA, suggesting that the reduced model with
four predictors fits as well as the full model:
> anova(fit.reduced, fit.full, test="Chisq")
Analysis of Deviance Table
Model 1: ynaffair ~ age + yearsmarried + religiousness + +rating
Model 2: ynaffair ~ gender + age + yearsmarried + children +
+religiousness + education + occupation + rating
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 593 613.63
2 589 608.22 4 5.4124 0.2475
Best Answer
If you want to find the best model for your data, one way to go could be using the function
dropterm()
from packageMASS
. It automatically test all models that differ from the current model by the dropping of one single term. This is done respecting marginality, so it doesn't try models in which one main effect is dopped if the same predictor is also present in any interaction (I think there is no good reason to fit such models anyway).You could start with a model where all terms (main effects and interactions) are present, and do a backward simplification: when you run the function
dropterm
you can ask the function to compare all possible reduced model with a likelihood ratio test or also to order them according to the AIC; then you can update your model removing superfluous predictors. You can repeat these step several times, until there are no more predictors that can be removed without causing a significant drop in the goodness of fit of the model (according to either the AIC or the likelihood ratio test), indicating that you have found the best GLM model for your data.