Solved – Interpreting dumthe variables in glm

categorical datageneralized linear modelr

I'm trying to understand the output of glm when a categorical variable has more than 2 categories.

I'm analysing if age affects death. Age is a categorical variable with 4 categories

I use the following code in R:

mydata <- read.delim("Data.txt", header = TRUE)
mydata$Agecod <- factor(mydata$Agecod)
mylogit <- glm(Death ~ Agecod, data = mydata, family = "binomial")
summary(mylogit)

Obtaining the following output:

Call:
glm(formula = Death ~ Agecod, family = "binomial", data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4006  -0.8047  -0.8047   1.2435   2.0963  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.5108     0.7303   0.699   0.4843  
Agecod2      -0.6650     0.7715  -0.862   0.3887  
Agecod3      -1.4722     0.7658  -1.922   0.0546 .
Agecod4      -2.5903     1.0468  -2.474   0.0133 *

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 237.32  on 184  degrees of freedom
Residual deviance: 223.73  on 181  degrees of freedom
  (1 observation deleted due to missingness)
AIC: 231.73

Number of Fisher Scoring iterations: 4

Since I have p-values for Agecod2, Agecod3 and Agecod4 and only Agecod4 has a significant p-value my questions are:

  1. Is really Age associated with death?
  2. Is only the 4th age category associated with death?
  3. What happens with the first category since I don't have its p-value?

Update:

Since Antoni Parellada says “It seems as though you have proven that old age is a good predictor of death” and Gung points “You cannot tell from your output if Age is associated with death” I’m still confused.

I understand that “Intercept” is representing Agecod1 and is the “reference level”. According to Gung “The Estimates for the rest are the differences between the indicated level and the reference level. The associated p-values are for the tests of the indicated level vs. the reference level in isolation.”

My question now is:

Since Agecod4 p-value (0.0133) is significantly different from Agecod1 (reference lelvel) it doesn’t mean that age is associated with death?

I have also tried to perform a nested test with the following command:

anova(mylogit, test="LRT")

Obtaining:

       Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
NULL                     184     237.32            
Agecod  3   13.583       181     223.73 0.003531 *

Does it mean that Age is definitively associated with death?

Update2:

I have solved my problem using binary logistic regression in SPSS. The output is the same than “mylogit” but with SPSS I obtain a global p-value for the overall variable Agecod which is 0.008.

I don’t know if is possible to obtain this “global p-value” with R, but since I know that I can use SPSS is not a big problem for me.

Best Answer

(Taking these out of order.)

  1. The first category, Agecod1, is represented by the intercept. That is called the "reference level" of your factor variable. The Estimate for (Intercept) is the mean of the response for that level. The Estimates for the rest are the differences between the indicated level and the reference level. The associated p-values are for the tests of the indicated level vs. the reference level in isolation. They probably don't answer the question you actually have; they may best be ignored.
  1. It makes no sense to say that "only the 4th age category associated with death". No such thing is logically possible.
  1. You cannot tell from your output if Age is associated with death. You need to fit a nested model that does not have Age, but is otherwise identical. Then you can perform a nested model test (anova(nested_model, mylogit, test="LRT")).

Updated to respond to additional information.

The anova() you ran tests Age as a whole. The p-value is listed as 0.003531, so that is significant unless your chosen alpha is less than that (which would be very unusual). Since your model is a logistic regression, there are several ways that such a test can be run, and therefor several p-values are possible. It is possible that SPSS is using a different method, and that both p-values are valid according to their own assumptions. To understand the different possibilities, it may help you to read my answer here: Why do my p-values differ between logistic regression output, chi-squared test, and the confidence interval for the OR?

Related Question