Regression – Overall Significance of Categorical Variables in Logistic Regression

categorical datalogisticregression

I have seen two approaches in binary logistic regression with categorical independent variables (IV) with more than two levels. In one approach, a reference category for the IV is defined and the rest of the categories are tested regarding this reference category,thus obtaining p-values for each category compared to the reference category (which is what I typically do). However, I have seen logistic regressions outputs showing an overall significance (or global significance) for categorical IVs outputs (only one p-value). I don't understand the second approach. I have read similar threads, but I have specific questions that they do not resolve:

What additional information does the second approach really provide? If there is an overall significance, would not be there differences between some of the categories?
Does the second approach assume that the IV is continuous (providing an estimate by unit of change in X)?
Could it happen that there were differences between the categories of an IV, but the overall test was not significant?

Perhaps they are basic questions, but I would appreciate your help.

Best Answer

I think you're referring to a likelihood ratio test.

Could it happen that there were differences between the categories of an IV, but the overall test was not significant?

The null hypothesis of the LRT is that all coefficients for the categorical variable are 0, with the alternative being that at least one coefficient is not 0.

I suppose it could be the case that you could fail to reject the null of the LRT and yet find differences between categories. Those two things aren't mutually exclusive.

What additional information does the second approach really provide? If there is an overall significance, would not be there differences between some of the categories?

Evaluating the statistical significance of the categorical variables via looking at their p-values does not tell us about the categorical variable as a whole, only about the single coefficient's statistical significance.

Here is an example in R

set.seed(0)
N = 100
cat = factor(sample(1:5, N, replace = T))
x = rnorm(N)             
eta = model.matrix(~x+cat)%*%c(1,2,0,0,0,0)

p = 1/(1+exp(-eta))
y = rbinom(length(p),1,p)


model = glm(y~x+cat, family = binomial())
summary(model)



Call:
glm(formula = y ~ x + cat, family = binomial())

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0345  -0.7243   0.2921   0.6635   1.8355  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.6631     0.5648   1.174   0.2404    
x             1.9981     0.4647   4.299 1.71e-05 ***
cat2          0.8766     0.8555   1.025   0.3056    
cat3          0.3210     0.8327   0.386   0.6998    
cat4          1.1713     0.8468   1.383   0.1666    
cat5          1.8251     0.8722   2.093   0.0364 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 122.173  on 99  degrees of freedom
Residual deviance:  86.275  on 94  degrees of freedom
AIC: 98.275

Number of Fisher Scoring iterations: 5

A priori, we know that the categories have no effect on the outcome, and yet cat5 comes out as significant. So if we did not have access to the true data generating mechanism, we may be tempted to say that the cat variable has an impact on the outcome.

But, that would be erroneous, since we are basing our decision on only one category of the variable. To determine if a model with the cat variable does better than a model without the cat variable, we can do a likelihood ratio test.

model0 = glm(y~x, family = binomial())
anova(model0,model, test = 'LRT')

Analysis of Deviance Table

Model 1: y ~ x
Model 2: y ~ x + cat
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1        98     92.167                     
2        94     86.275  4   5.8923   0.2073

We fail to reject the null from this test. That means that from our data, we can not say that at least one of the coefficients from the cat variable is 0. And that would be correct. That the cat5 variable is significant is just an artifact of sampling and random error.

Related Solutions

Solved – Overall significance test for the effect of an independent continuous variable on a categorical dependent variable

It's very important to first define the nature of your dependent variable. If qualitative ordinal, then an ordinal probit(or logit) model is the right choice. With this model you will have a unique slope parameter per explanatory variable whatever the category as only the constant changes with categories. If your dependent variable is social status then it can be easily considered as ordinal. Thus, inference on an independent variable effect becomes straighforward.

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

I think you are making this hard on yourself. Make sure race is a factor variable so that the software provides the overall $\chi^2$ of association with $k-1$ d.f. for $k$ categories. Coding doesn't affect the value of $\chi^2$. Don't use a stepwise process for making inference about the importance of race. Use the overall "chunk" test as described above, which has a built-in perfect multiplicity adjustment besides being invariant to coding. In R this would look like (for a binary or ordinal logistic model predicting $Y$):

require(rms)
f <- lrm(Y ~ rcs(age, 4) + race)
anova(f)   # 3 d.f. test for age, k-1 for race
# also prints 2 d.f. test of linearity in age
# age fit is restricted cubic spline with 4 default knots

When doing multiple imputation with the Hmisc package aregImpute function or with the mice package, you would substitute the following for the 2nd line above:

f <- fit.mult.impute(Y ~ rcs(age, 4) + race, lrm, impute_object, n.impute=20)

which would adjust the covariance matrix for multiple imputation [n.impute recommended to be the percent of observations that have any variable missing].

Best Answer

Related Solutions

Solved – Overall significance test for the effect of an independent continuous variable on a categorical dependent variable

Solved – Interpretation of logistic regression intercept with one dumthe coded categorical variable

Related Question