Solved – Interpreting logistic regression results when explanatory variable has multiple levels

logisticlogit

When an explanatory variable in logistic regression is binary, the interpretation is relatively straightforward.

For example, when your response variable is college admission (binary: yes or no) and explanatory variable is gender (with two levels: female and male), you can say that being male increases (or decreases) the odds of being admitted by a factor of n, holding other variables constant.

But what is the explanatory variable has multiple levels / categories? For instance, your response variable is college admission, but explanatory variable is ethnicity (with levels: Caucasian, African_American, Hispanic, and Asian). R (or any other software) will still run the model, automatically choosing one one of the levels as reference (in R it will be African_American unless specified otherwise). The results will tell us how not being African_American (or vide versa) affects the odds of admission, but says nothing for being Causasian, Hispanic, or Asian.

What's the best way of dealing with this problem? Should one create dummy variables for each ethnic group, or is there a better way?

Best Answer

The interpretation for categorical variables with more than 2 levels is very similar to the binary case you mention; for a $k$-level categorical variable, you will have $k-1$ regression coefficients each of which compare the odds of the outcome to the reference group. For the example you state, ethnicity (Caucasian, African-American, Hispanic, and Asian), let us assume your referent (baseline) group is African-American. Many software packages for logistic regressions will give you 3 Odds ratios (for a 4-level categorical predictor) once you run the regression. Let us quickly look at how this is done in R based on simulated dataset:

    ###########Simulate Data###########
    set.seed(123) # set seed if you want to re-produce 
                  #simulation results
    x1 <- sample(c("AF","AS","HI","CA"),10000,replace = T) #Caucasian (CA), African-American(AA), Hispanic(HI), 
# and Asian(AA)
    x1 <- factor(x1,levels =c("AF","AS","HI","CA")) 
    # ensure the ordering by setting AF as reference

    x1.fac <- model.matrix(~ x1) # generate dummy variables for 
      #simulation purposes (in practice you may not need to do 
      #this)
    betas <- c(.2,.5,.53) # log odds comparing the three groups 
      #to the referent level of AF (these are just made up 
      #values for illustration and simulation purposes!)
    xbeta <- x1.fac[,-1]%*%betas #need only k-1 dummies for a 
                                 #variable with k-levels
    y <- rbinom(n = 10000, size = 1, prob = 
            exp(xbeta)/(1+exp(xbeta))) # Simulate outcome (Y)

#Finally we have the following sample data:

    example_data <- data.frame(y,x1)
    
    ####Run regression of outcome against ethnicity
    model1 <- glm(y~x1,family = binomial,data = example_data)
    exp(coef(model1))[-1] ###Odds Ratios comparing each group 
                          #with the reference group of AF
    x1AS     x1HI     x1CA 
    1.229610 1.800985 1.796416 

So what does the odds ratio of 1.23 for Asians mean? This means, compared to African-Americans Asians had 23% higher odds of the outcome. Equivalently, you can interpret as Asians have 1.23 times the odds of the outcome compared to the referent group of African Americans. The odds of 1.800 and 1.796 for Caucasians and Hispanics, respectively, are interpreted in the same manner. The most important part of modeling categorical variables is identifying the proper referent group. You can always change the reference group by using the relevel() command in R. See example here.

In order to make comparison between two groups where one of them is not a referent group, there are a few ways to go:

  1. Use relevel() function and re-run the regression changing the reference group to your variable of interest (not my favorite approach when there are many levels in your categorical predictor)

  2. Use already built in packages to do this comparison.

I am not sure how this is done in Stata or SAS (probably contrast statement for SAS) but you can easily do this in R using the car package. For example, if you want to test if the odds of the outcome differ between Caucasians and Hispanics, use the following commands:

    library(car)
    linearHypothesis(model1, c("x1CA - x1HI = 0"))

    Linear hypothesis test
    
    Hypothesis:
    - x1HI  + x1CA = 0
    
    Model 1: restricted model
    Model 2: y ~ x1
    
      Res.Df Df  Chisq Pr(>Chisq)
    1   9997                     
    2   9996  1 0.0018     0.9658

In this case, we fail to reject the null hypothesis of no difference in the odds of the outcome between Caucasians and Hispanics (p-value=0.9658).

Related Question