Regression – Overall Significance of Categorical Variables in Logistic Regression

categorical datalogisticregression

I have seen two approaches in binary logistic regression with categorical independent variables (IV) with more than two levels. In one approach, a reference category for the IV is defined and the rest of the categories are tested regarding this reference category,thus obtaining p-values for each category compared to the reference category (which is what I typically do). However, I have seen logistic regressions outputs showing an overall significance (or global significance) for categorical IVs outputs (only one p-value). I don't understand the second approach. I have read similar threads, but I have specific questions that they do not resolve:

  • What additional information does the second approach really provide? If there is an overall significance, would not be there differences between some of the categories?
  • Does the second approach assume that the IV is continuous (providing an estimate by unit of change in X)?
  • Could it happen that there were differences between the categories of an IV, but the overall test was not significant?

Perhaps they are basic questions, but I would appreciate your help.

Best Answer

I think you're referring to a likelihood ratio test.

Could it happen that there were differences between the categories of an IV, but the overall test was not significant?

The null hypothesis of the LRT is that all coefficients for the categorical variable are 0, with the alternative being that at least one coefficient is not 0.

I suppose it could be the case that you could fail to reject the null of the LRT and yet find differences between categories. Those two things aren't mutually exclusive.

What additional information does the second approach really provide? If there is an overall significance, would not be there differences between some of the categories?

Evaluating the statistical significance of the categorical variables via looking at their p-values does not tell us about the categorical variable as a whole, only about the single coefficient's statistical significance.

Here is an example in R

set.seed(0)
N = 100
cat = factor(sample(1:5, N, replace = T))
x = rnorm(N)             
eta = model.matrix(~x+cat)%*%c(1,2,0,0,0,0)

p = 1/(1+exp(-eta))
y = rbinom(length(p),1,p)


model = glm(y~x+cat, family = binomial())
summary(model)



Call:
glm(formula = y ~ x + cat, family = binomial())

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0345  -0.7243   0.2921   0.6635   1.8355  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.6631     0.5648   1.174   0.2404    
x             1.9981     0.4647   4.299 1.71e-05 ***
cat2          0.8766     0.8555   1.025   0.3056    
cat3          0.3210     0.8327   0.386   0.6998    
cat4          1.1713     0.8468   1.383   0.1666    
cat5          1.8251     0.8722   2.093   0.0364 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 122.173  on 99  degrees of freedom
Residual deviance:  86.275  on 94  degrees of freedom
AIC: 98.275

Number of Fisher Scoring iterations: 5

A priori, we know that the categories have no effect on the outcome, and yet cat5 comes out as significant. So if we did not have access to the true data generating mechanism, we may be tempted to say that the cat variable has an impact on the outcome.

But, that would be erroneous, since we are basing our decision on only one category of the variable. To determine if a model with the cat variable does better than a model without the cat variable, we can do a likelihood ratio test.

model0 = glm(y~x, family = binomial())
anova(model0,model, test = 'LRT')

Analysis of Deviance Table

Model 1: y ~ x
Model 2: y ~ x + cat
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1        98     92.167                     
2        94     86.275  4   5.8923   0.2073

We fail to reject the null from this test. That means that from our data, we can not say that at least one of the coefficients from the cat variable is 0. And that would be correct. That the cat5 variable is significant is just an artifact of sampling and random error.