I am having trouble interpreting the z values for categorical variables in logistic regression. In the example below I have a categorical variable with 3 classes and according to the z value, CLASS2 might be relevant while the others are not.
But now what does this mean?
That I could merge the other classes to one?
That the whole variable might not be a good predictor?
This is just an example and the actual z values here are not from a real problem, I just have difficulties about their interpretation.
Estimate Std. Error z value Pr(>|z|)
CLASS0 6.069e-02 1.564e-01 0.388 0.6979
CLASS1 1.734e-01 2.630e-01 0.659 0.5098
CLASS2 1.597e+00 6.354e-01 2.514 0.0119 *
Best Answer
The following explanation is not limited to logistic regression but applies equally in normal linear regression and other GLMs. Usually,
R
excludes one level of the categorical and the coefficients denote the difference of each class to this reference class (or sometimes called baseline class) (this is called dummy coding or treatment contrasts inR
, see here for an excellent overview of the different contrast options). To see the current contrasts inR
, typeoptions("contrasts")
. Normally,R
orders the levels of the categorical variable alphabetically and takes the first as reference class. This is not always optimal and can be changed by typing (here, we would set the reference class to "c" in the new variable)new.variable <- relevel(old.variable, ref="c")
. For each coefficient of every level of the categorical variable, a Wald test is performed to test whether the pairwise difference between the coefficient of the reference class and the other class is different from zero or not. This is what the $z$ and $p$-values in the regression table are. If only one categorical class is significant, this does not imply that the whole variable is meaningless and should be removed from the model. You can check the overall effect of the variable by performing a likelihood ratio test: fit two models, one with and one without the variable and typeanova(model1, model2, test="LRT")
inR
(see example below). Here is an example:The level
rank1
has been omitted and each coefficient ofrank
denotes the difference between the coefficient ofrank1
and the correspondingrank
level. So the difference between the coefficient ofrank1
andrank2
would be $-0.675$. The coefficient ofrank1
is simply the intercept. So the true coefficient ofrank2
would be $-3.99 - 0.675 = -4.67$. The Wald tests here test whether the difference between the coefficient of the reference class (hererank1
) and the corresponding levels differ from zero. In this case, we have evidence that the coefficients of all classes differ from the coefficient ofrank1
. You could also fit the model without an intercept by adding- 1
to the model formula to see all coefficients directly:Note that the intercept is gone now and that the coefficient of
rank1
is exactly the intercept of the first model. Here, the Wald test checks not the pairwise difference between coefficients but the hypothesis that each individual coefficient is zero. Again, we have evidence that every coefficient ofrank
differs from zero. Finally, to check whether the whole variablerank
improves the model fit, we fit one model with (my.mod1
) and one without the variablerank
(my.mod2
) and conduct a likelihood ratio test. This tests the hypothesis that all coefficients ofrank
are zero:The likelihood ratio test is highly significant and we would conclude that the variable
rank
should remain in the model.This post is also very interesting.