I have seen two approaches in binary logistic regression with categorical independent variables (IV) with more than two levels. In one approach, a reference category for the IV is defined and the rest of the categories are tested regarding this reference category,thus obtaining p-values for each category compared to the reference category (which is what I typically do). However, I have seen logistic regressions outputs showing an overall significance (or global significance) for categorical IVs outputs (only one p-value). I don't understand the second approach. I have read similar threads, but I have specific questions that they do not resolve:
- What additional information does the second approach really provide? If there is an overall significance, would not be there differences between some of the categories?
- Does the second approach assume that the IV is continuous (providing an estimate by unit of change in X)?
- Could it happen that there were differences between the categories of an IV, but the overall test was not significant?
Perhaps they are basic questions, but I would appreciate your help.
Best Answer
I think you're referring to a likelihood ratio test.
The null hypothesis of the LRT is that all coefficients for the categorical variable are 0, with the alternative being that at least one coefficient is not 0.
I suppose it could be the case that you could fail to reject the null of the LRT and yet find differences between categories. Those two things aren't mutually exclusive.
Evaluating the statistical significance of the categorical variables via looking at their p-values does not tell us about the categorical variable as a whole, only about the single coefficient's statistical significance.
Here is an example in R
A priori, we know that the categories have no effect on the outcome, and yet
cat5
comes out as significant. So if we did not have access to the true data generating mechanism, we may be tempted to say that thecat
variable has an impact on the outcome.But, that would be erroneous, since we are basing our decision on only one category of the variable. To determine if a model with the
cat
variable does better than a model without thecat
variable, we can do a likelihood ratio test.We fail to reject the null from this test. That means that from our data, we can not say that at least one of the coefficients from the
cat
variable is 0. And that would be correct. That thecat5
variable is significant is just an artifact of sampling and random error.