Solved – Logistic regression with multi-level categorical predictors

logisticodds-ratioregressionregression coefficientsself-study

I am working through the examples in Andy Field’s Discovering Statistics with R. I am stuck on the last Task (Task 3) of the Smart Alex exercises for Chapt 8 logistic regression (tasks are here: https://studysites.uk.sagepub.com/dsur/study/DSUR%20Smart%20Alex-Labcoat%20Leni-Self%20Test%20Answers/DSUR%20Chapter%2008%20Web%20Material.pdf, data are here: https://studysites.uk.sagepub.com/dsur/study/articles.htm. The example starts on page 25 of the Pdf).

The example looks at predicting the probability of condom use based on several predictor variables (output for the coefficients and odds ratios pasted below). What I’m stuck on is the interpretation of the coefficients and odds ratio for a predictor variable that is categorical and has three levels (the predictor in question is previous condom use, labelled “previous” in the example). There are three categories of previous condom use ("No condom", "Condom used", "First Time with Partner"). In the model summary, there are coefficients for 1)"PreviousCondomUsed" and 2)"PreviousFirst Time with Partner". I would interpret these parameters as 1) "PreviousCondomUsed": the difference in the coefficients and odds ratio between "Condom Used" and the reference category "No condom", and 2)"PreviousFirst Time with Partner": the difference in the coefficients and odds ratio between "First Time with Partner" and the reference category "No Condom". However, the answers to Task 3 (in the Pdf) explain the "PreviousCondomUsed” coefficient, for example, as comparing group "CondomUsed" with the other two groups. I don’t think this is correct, I would have thought that this parameter was comparing group "CondomUsed" with the reference group "NoCondom".

Am I correct in assuming that when there are more than 2 levels of a categorical predictor variable in logistic regression, a level is chosen as the baseline level, and then pairwise comparisons are made between each level of the predictor and the baseline? In relation to the example, am I correct in assuming that the parameters "PreviousCondomUsed" and "PreviousFirst Time with Partner" are comparing level "CondomUsed" with level "NoCondom", and comparing "First Time with Partner" with "NoCondom" respectively?

    Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
    (Intercept)                     -4.959739   1.146497  -4.326 1.52e-05 ***
    genderFemale                     0.002656   0.572823   0.005  0.99630    
    safety                          -0.482460   0.236033  -2.044  0.04095 *  
    perceive                         0.949088   0.236972   4.005 6.20e-05 ***
    selfcon                          0.347626   0.126842   2.741  0.00613 ** 
    previousCondom used              1.087196   0.551952   1.970  0.04887 *  
    previousFirst Time with partner -0.016615   1.399907  -0.012  0.99053    
    sexexp                           0.180423   0.111586   1.617  0.10590    

The odds ratios are:

                                    exp.mod2.coefficients.
    (Intercept)                                0.007014758
    genderFemale                               1.002659308
    safety                                     0.617263292
    perceive                                   2.583353254
    selfcon                                    1.415702224
    previousCondom used                        2.965946499
    previousFirst Time with partner            0.983522066
    sexexp                                     1.197724363

Best Answer

You do seem to have found an error in the source you cite, but you have to be careful in general. Note this sentence from that source:

Previous use has been split into two components (according to whatever contrasts were specified for this variable).

Exactly what is being displayed in the output of a model for a multi-level categorical variable depends on how the contrasts for that variable were coded, which isn't shown in the excerpts you quote in your question. In this case it seems that the R default treatment contrast was used. In that case, you are correct that the coefficient (and statistical tests) for each level's coefficient is relative to the reference level for that variable.

Several other types of contrasts are possible, however, as explained for example on this page. One might be able to devise a contrast that provides, for each of 2 levels, a comparison of that level to the average of the other levels, as the source claims for this example. It just doesn't appear that such a contrast was actually used in this particular case.