Solved – Interpret logistic regression output with multiple categorical & continious variables

categorical datainterpretationlogisticlogit

I'm trying to analyze some data obtained from a survey asking employers about their difficulty in hiring certain occupations (80 total occupations). But I'm uncertain about the interpretation of my results.

Dependent Variable: 0 = not difficult 1 = difficult.

Independent Variables:

  1. occupation group (occ.group 1 to 9 different groups) grouped the occupations to cut down on the total number of options
  2. area (0 = urban 1 = rural)
  3. hourly wage of the occupation (hr.wage continuous, numeric).

The data is structured as such:

   empnum response    soc     occ.group  area    hr.wage
1  123450        1    1         1          1        70.20
2  543210        0    1         1          0        50.10
3  111111        0    1         1          1        71.10
4  222222        1    2         2          1        60.23
5  333333        1    2         4          0        100.57
6  4444444       0    80        9          1        60.18

Model is as follows:

logit1 <- glm(response~occ.group+area+hr.wage, family=binomial, data=health.sub)

Coefficients:
                Estimate Std. Error  z value  Pr(>|z|)   
(Intercept)     0.184978   0.466055    0.397   0.69144   
occ.group2     -0.243524   0.528233   -0.461   0.64479   
occ.group3      0.281285   0.407879    0.690   0.49043   
occ.group4     -0.063578   0.510229   -0.125   0.90084   
occ.group5     -0.039797   0.403032   -0.099   0.92134   
occ.group6      0.419655   0.475109    0.883   0.37708   
occ.group7     -0.898869   0.652530   -1.378   0.16835   
occ.group8    -15.015216 394.6834     -0.038   0.96965   
occ.group9     -0.370350   0.405532   -0.913   0.36111   
area1           0.008863   0.129376    0.069   0.94538   
hr.wage         0.010475   0.004028    2.600   0.00931   **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1679.4  on 1259  degrees of freedom
Residual deviance: 1638.5  on 1249  degrees of freedom
AIC: 1660.5

Number of Fisher Scoring iterations: 13

I'm trying to get an idea if my interpretations are correct.

  1. Because occ.group2 coeff is negative it would have lower odds of being difficult to hire than occ.group1 (reference group)

    odds are exp(-0.2353) = .79
    probability = .4414

  2. An employer with occupation in occ.group2 and area1 = -0.243524 + 0.008863 = exp(-0.234661) = 0.79 odds of difficult. Would that be compared to to occ.group1 (reference group) in the same area1 or would it be area0. I guess I'm having a hard time interpreting the reference group with two categorical variables.

  3. Employer with occupation in occ.group2 and area0? Would I add the coefficient for occ.group2 (-0.243524) and the intercept (-0.243524)? That doesn't seem right, but not sure how to deal with multiple reference groups. What about an employer with occ.group1 and area0?

  4. Would the odds of an employer with occupation in group2 and area1 with a 1 dollar increase in hr.wage = -0.243524 + 0.008863 + 0.010475 = exp(-0.224186) = odds of .799 compared to employer with occupation group1 in area1 with 1 dollar hr.wage increase?


Thanks for the explanations and clarification @EdM.

I also had a question about doing deviation/effects coding of the occ.group categorical predictor, because I'm not convinced that comparisons between the occ.group is meaningful due to their major differences. I've read that effects coding makes it possible to compare each group to the overall mean response. But again, I'm not sure I'm fully understanding how to interpret the results.

I've used the contr.sum and contrasts commands in R to apply effects coding to the occ.group variable and end up with this output.

contr.sum(5)
contrasts(health.sub$occ.group) = contr.sum(5)
Call:
glm(formula = response ~ occ.group + area + hr.wage, family = binomial, 
data = health.sub)

Deviance Residuals: 
Min       1Q   Median       3Q      Max  
-1.7637  -1.2172   0.8009   1.0462   1.1513  
Coefficients:
#UPDATED devication/effects coded model based on @EdM 's comment.  
#`occ.group8` dropped due to limited responses
#Other `occ.group`'s were reorganized to more appropriate categories 1-5 
             Estimate   Std. Error z value  Pr(>|z|)    
(Intercept)   0.546408   0.147042   3.716   0.000202 ***
occ.group1   -0.265066   0.139886  -1.895   0.058109 .  
occ.group2   -0.507674   0.110791  -4.582   0.0000046 ***
occ.group3    0.186679   0.154514   1.208   0.226981    
occ.group4    0.347765   0.128618   2.704   0.006854 ** 
area1         0.003353   0.129615   0.026   0.979365    
hr.wage       0.002139   0.003228   0.663   0.507424    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1679.4  on 1259  degrees of freedom
Residual deviance: 1635.1  on 1253  degrees of freedom
AIC: 1649.1
Number of Fisher Scoring iterations: 4

/#UPDATED values to reflect @EdM response.

How would I go about interpreting the odds for an employer with occ.group2 and area0 with an hr.wage = 50.00 given the effects coding?

I believe the odds would be calculated as 0.5464 +(-0.5077) + (0.5464 – 0.0034) + (0.0021*50.00)) = exp(0.6937) = odds of 2.001. Which is in comparison to the mean of the mean occ.group logits rather than any specific reference group.

Best Answer

Your understanding seems generally correct. The intercept in this and in other standard R regression summaries represents the case for the reference levels of all categorical variables (false for logical) and for a 0 value of all continuous variables.

So for your question 2 the reference is occ.group1 and area0, as it is for all comparisons given the way you have labeled the levels of the variables. occ.group1 and area0 as in your question 3 is the reference group with odds calculated from the intercept for a 0 wage, but you need to specify an hourly wage to get the odds for a non-zero wage.

Your interpretation in question 4 seems to be a bit off. If the areas and wages are the same for the two groups then the only difference to consider in the specified comparison is the coefficient for occ.group2 versus the occ.group1 reference.

Related Question