I'm trying to analyze some data obtained from a survey asking employers about their difficulty in hiring certain occupations (80 total occupations). But I'm uncertain about the interpretation of my results.
Dependent Variable: 0 = not difficult 1 = difficult.
Independent Variables:
- occupation group (
occ.group
1 to 9 different groups) grouped the occupations to cut down on the total number of options area
(0 = urban 1 = rural)- hourly wage of the occupation (
hr.wage
continuous, numeric).
The data is structured as such:
empnum response soc occ.group area hr.wage
1 123450 1 1 1 1 70.20
2 543210 0 1 1 0 50.10
3 111111 0 1 1 1 71.10
4 222222 1 2 2 1 60.23
5 333333 1 2 4 0 100.57
6 4444444 0 80 9 1 60.18
Model is as follows:
logit1 <- glm(response~occ.group+area+hr.wage, family=binomial, data=health.sub)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.184978 0.466055 0.397 0.69144
occ.group2 -0.243524 0.528233 -0.461 0.64479
occ.group3 0.281285 0.407879 0.690 0.49043
occ.group4 -0.063578 0.510229 -0.125 0.90084
occ.group5 -0.039797 0.403032 -0.099 0.92134
occ.group6 0.419655 0.475109 0.883 0.37708
occ.group7 -0.898869 0.652530 -1.378 0.16835
occ.group8 -15.015216 394.6834 -0.038 0.96965
occ.group9 -0.370350 0.405532 -0.913 0.36111
area1 0.008863 0.129376 0.069 0.94538
hr.wage 0.010475 0.004028 2.600 0.00931 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1679.4 on 1259 degrees of freedom
Residual deviance: 1638.5 on 1249 degrees of freedom
AIC: 1660.5
Number of Fisher Scoring iterations: 13
I'm trying to get an idea if my interpretations are correct.
-
Because
occ.group2
coeff is negative it would have lower odds of being difficult to hire thanocc.group1
(reference group)odds are exp(-0.2353) = .79
probability = .4414 -
An employer with occupation in
occ.group2
andarea1
= -0.243524 + 0.008863 = exp(-0.234661) = 0.79 odds of difficult. Would that be compared to toocc.group1
(reference group) in the same area1 or would it bearea0
. I guess I'm having a hard time interpreting the reference group with two categorical variables. -
Employer with occupation in
occ.group2
andarea0
? Would I add the coefficient forocc.group2
(-0.243524) and the intercept (-0.243524)? That doesn't seem right, but not sure how to deal with multiple reference groups. What about an employer with occ.group1 andarea0
? -
Would the odds of an employer with occupation in
group2
andarea1
with a 1 dollar increase inhr.wage
= -0.243524 + 0.008863 + 0.010475 = exp(-0.224186) = odds of .799 compared to employer with occupationgroup1
in area1 with 1 dollarhr.wage
increase?
Thanks for the explanations and clarification @EdM.
I also had a question about doing deviation/effects coding of the occ.group
categorical predictor, because I'm not convinced that comparisons between the occ.group
is meaningful due to their major differences. I've read that effects coding makes it possible to compare each group to the overall mean response. But again, I'm not sure I'm fully understanding how to interpret the results.
I've used the contr.sum
and contrasts
commands in R to apply effects coding to the occ.group
variable and end up with this output.
contr.sum(5)
contrasts(health.sub$occ.group) = contr.sum(5)
Call:
glm(formula = response ~ occ.group + area + hr.wage, family = binomial,
data = health.sub)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7637 -1.2172 0.8009 1.0462 1.1513
Coefficients:
#UPDATED devication/effects coded model based on @EdM 's comment.
#`occ.group8` dropped due to limited responses
#Other `occ.group`'s were reorganized to more appropriate categories 1-5
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.546408 0.147042 3.716 0.000202 ***
occ.group1 -0.265066 0.139886 -1.895 0.058109 .
occ.group2 -0.507674 0.110791 -4.582 0.0000046 ***
occ.group3 0.186679 0.154514 1.208 0.226981
occ.group4 0.347765 0.128618 2.704 0.006854 **
area1 0.003353 0.129615 0.026 0.979365
hr.wage 0.002139 0.003228 0.663 0.507424
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1679.4 on 1259 degrees of freedom
Residual deviance: 1635.1 on 1253 degrees of freedom
AIC: 1649.1
Number of Fisher Scoring iterations: 4
/#UPDATED values to reflect @EdM response.
How would I go about interpreting the odds for an employer with occ.group2
and area0
with an hr.wage
= 50.00 given the effects coding?
I believe the odds would be calculated as 0.5464 +(-0.5077) + (0.5464 – 0.0034) + (0.0021*50.00)) = exp(0.6937) = odds of 2.001. Which is in comparison to the mean of the mean occ.group
logits rather than any specific reference group.
Best Answer
Your understanding seems generally correct. The intercept in this and in other standard R regression summaries represents the case for the reference levels of all categorical variables (false for logical) and for a 0 value of all continuous variables.
So for your question 2 the reference is
occ.group1
andarea0
, as it is for all comparisons given the way you have labeled the levels of the variables.occ.group1
andarea0
as in your question 3 is the reference group with odds calculated from the intercept for a 0 wage, but you need to specify an hourly wage to get the odds for a non-zero wage.Your interpretation in question 4 seems to be a bit off. If the areas and wages are the same for the two groups then the only difference to consider in the specified comparison is the coefficient for
occ.group2
versus theocc.group1
reference.