I'm trying to understand the output of glm
when a categorical variable has more than 2 categories.
I'm analysing if age affects death. Age is a categorical variable with 4 categories
I use the following code in R:
mydata <- read.delim("Data.txt", header = TRUE)
mydata$Agecod <- factor(mydata$Agecod)
mylogit <- glm(Death ~ Agecod, data = mydata, family = "binomial")
summary(mylogit)
Obtaining the following output:
Call:
glm(formula = Death ~ Agecod, family = "binomial", data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4006 -0.8047 -0.8047 1.2435 2.0963
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.5108 0.7303 0.699 0.4843
Agecod2 -0.6650 0.7715 -0.862 0.3887
Agecod3 -1.4722 0.7658 -1.922 0.0546 .
Agecod4 -2.5903 1.0468 -2.474 0.0133 *
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 237.32 on 184 degrees of freedom
Residual deviance: 223.73 on 181 degrees of freedom
(1 observation deleted due to missingness)
AIC: 231.73
Number of Fisher Scoring iterations: 4
Since I have p-values for Agecod2
, Agecod3
and Agecod4
and only Agecod4
has a significant p-value my questions are:
- Is really
Age
associated with death? - Is only the 4th age category associated with death?
- What happens with the first category since I don't have its p-value?
Update:
Since Antoni Parellada says “It seems as though you have proven that old age is a good predictor of death” and Gung points “You cannot tell from your output if Age is associated with death” I’m still confused.
I understand that “Intercept” is representing Agecod1 and is the “reference level”. According to Gung “The Estimates for the rest are the differences between the indicated level and the reference level. The associated p-values are for the tests of the indicated level vs. the reference level in isolation.”
My question now is:
Since Agecod4 p-value (0.0133) is significantly different from Agecod1 (reference lelvel) it doesn’t mean that age is associated with death?
I have also tried to perform a nested test with the following command:
anova(mylogit, test="LRT")
Obtaining:
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 184 237.32
Agecod 3 13.583 181 223.73 0.003531 *
Does it mean that Age is definitively associated with death?
Update2:
I have solved my problem using binary logistic regression in SPSS. The output is the same than “mylogit” but with SPSS I obtain a global p-value for the overall variable Agecod which is 0.008.
I don’t know if is possible to obtain this “global p-value” with R, but since I know that I can use SPSS is not a big problem for me.
Best Answer
(Taking these out of order.)
Agecod1
, is represented by the intercept. That is called the "reference level" of your factor variable. TheEstimate
for(Intercept)
is the mean of the response for that level. TheEstimate
s for the rest are the differences between the indicated level and the reference level. The associated p-values are for the tests of the indicated level vs. the reference level in isolation. They probably don't answer the question you actually have; they may best be ignored.Age
is associated with death. You need to fit a nested model that does not haveAge
, but is otherwise identical. Then you can perform a nested model test (anova(nested_model, mylogit, test="LRT")
).Updated to respond to additional information.
The
anova()
you ran testsAge
as a whole. The p-value is listed as0.003531
, so that is significant unless your chosen alpha is less than that (which would be very unusual). Since your model is a logistic regression, there are several ways that such a test can be run, and therefor several p-values are possible. It is possible that SPSS is using a different method, and that both p-values are valid according to their own assumptions. To understand the different possibilities, it may help you to read my answer here: Why do my p-values differ between logistic regression output, chi-squared test, and the confidence interval for the OR?