I am having trouble interpreting the z values for categorical variables in logistic regression. In the example below I have a categorical variable with 3 classes and according to the z value, CLASS2 might be relevant while the others are not.

But now what does this mean?

That I could merge the other classes to one?

That the whole variable might not be a good predictor?

This is just an example and the actual z values here are not from a real problem, I just have difficulties about their interpretation.

```
Estimate Std. Error z value Pr(>|z|)
CLASS0 6.069e-02 1.564e-01 0.388 0.6979
CLASS1 1.734e-01 2.630e-01 0.659 0.5098
CLASS2 1.597e+00 6.354e-01 2.514 0.0119 *
```

## Best Answer

The following explanation is

not limited to logistic regressionbut applies equally in normal linear regression and other GLMs. Usually,`R`

excludes one level of the categorical and the coefficients denotethe difference of each class to this reference class (or sometimes called baseline class)(this is called dummy coding or treatment contrasts in`R`

, see here for an excellent overview of the different contrast options). To see the current contrasts in`R`

, type`options("contrasts")`

. Normally,`R`

orders the levels of the categorical variable alphabetically and takes the first as reference class. This is not always optimal and can be changed by typing (here, we would set the reference class to "c" in the new variable)`new.variable <- relevel(old.variable, ref="c")`

. For each coefficient of every level of the categorical variable, a Wald test is performed totest whether the pairwise difference between the coefficient of the reference class and the other class is different from zeroor not. This is what the $z$ and $p$-values in the regression table are. If only one categorical class is significant, this doesnotimply that the whole variable is meaningless and should be removed from the model. You can check the overall effect of the variable by performing a likelihood ratio test: fit two models, one with and one without the variable and type`anova(model1, model2, test="LRT")`

in`R`

(see example below). Here is an example:The level

`rank1`

has been omitted and each coefficient of`rank`

denotes the difference between the coefficient of`rank1`

and the corresponding`rank`

level. So the difference between the coefficient of`rank1`

and`rank2`

would be $-0.675$.The coefficient ofSo the true coefficient of`rank1`

is simply the intercept.`rank2`

would be $-3.99 - 0.675 = -4.67$. The Wald tests here test whether the difference between the coefficient of the reference class (here`rank1`

) and the corresponding levels differ from zero. In this case, we have evidence that the coefficients of all classes differ from the coefficient of`rank1`

. You could also fit the model without an intercept by adding`- 1`

to the model formula to see all coefficients directly:Note that the intercept is gone now and that the coefficient of

`rank1`

is exactly the intercept of the first model. Here, the Wald test checks not the pairwise difference between coefficients but the hypothesis thateach individual coefficient is zero.Again, we have evidence that every coefficient of`rank`

differs from zero. Finally, to check whether the whole variable`rank`

improves the model fit, we fit one model with (`my.mod1`

) and one without the variable`rank`

(`my.mod2`

) and conduct a likelihood ratio test. This tests the hypothesis that all coefficients of`rank`

are zero:The likelihood ratio test is highly significant and we would conclude that the variable

`rank`

should remain in the model.This post is also very interesting.