R – Resolving Slightly Off Coefficients in Weighted Effect Coding for Binomial Logistic Regression

generalized linear modellogisticrregressionregression coefficients

I noticed some discrepancies between the expected output and actual output for binomial glm in R using customized weighted effect contrast coding.

The purpose of weighted effect coding is to use the overall sample mean as the reference group. All data is taken from the source package wec as shown below.

library(wec)
library(broom)
library(tidyverse)

data(PUMS)

#Convert wage to a binary variable
PUMS <- as_tibble(PUMS) %>%
  mutate(wage = if_else(wage>50000, 1, 0))

#Switch to weighted effect contrasts
contrasts(PUMS$race) <- contr.wec(PUMS$race, "Black")

#Overall binomial percent
overall <- PUMS %>%
  summarise(Percent = sum(wage)/n())
 
overall_odds <- overall[[1,1]]/(1-overall[[1,1]])

#Calculated perfectly when treating the 0-1 binomial variable as linear
#The (intercept matches overall[[1, 1]])
tidy(lm(wage ~ race, data = PUMS))

#There are some odd slight discrepancies between the Odds_Ratio columns
tidy(glm(wage ~ race, family = binomial, data = PUMS)) %>%
  mutate(Odds_Ratio = exp(estimate))

PUMS %>%
  group_by(race) %>%
  summarise(Percent = sum(wage)/n()) %>%
  mutate(Odds_Ratio = (Percent/(1 - Percent))/overall_odds)

As you can see if you run the code, the output from the last two chunks is slightly off for some reason. For example, the odds ratio for the Asian group should be 1.59 but it appears as 1.61 in the glm model.

However, if the binomial outcome variable is treated as continuous a simple lm call predicts the odds ratios perfectly (the intercept matches overall[[1, 1]]).

Is there some mechanistic reason or rounding error that causes binomial regression to behave this way, and can I adjust settings to fix it? I would really appreciate help in this as it is critical for my research.

EDIT:

If anybody ever encounters this issue again, here is a modification of Ben Bolker's answer to get the weighted overall odds and expected coefficients for multiple groups, not just the unweighted odds in regular two-variable effect coding.

library(broom)
library(wec)
library(tidyverse)

set.seed(11)
dd <- data.frame(x = factor(rep(c("a", "b", "c"), times = c(153, 46, 51))), 
                 y = rbinom(250, 1, 0.5))

contrasts(dd$x) <- wec::contr.wec(dd$x, omitted = "b")

dds <- (dd |>
          dplyr::group_by(x) |>
          dplyr::summarise(pct = mean(y), odds = pct/(1-pct), count = n()))

#weighted
overall_odds <- exp(weighted.mean(log(dds$odds), dds$count))

dds |> dplyr::mutate(estimate = odds/overall_odds) |>
  relocate(estimate, .before = pct)

tidy(glm(y ~ x, family = binomial, dd), exponentiate = TRUE)

Best Answer

I made up a simpler, more extreme example that strongly hints that the problem is nonlinear averaging, not some rounding issue.

library(broom)
dd <- data.frame(x = factor(rep(c("a", "b"), each = 100)), 
                 y = rep(c(1,0,1,0), times = c(95, 5, 10, 90)))
contrasts(dd$x) <- wec::contr.wec(dd$x, omitted = "b")
overall_odds <- mean(dd$y)/(1-mean(dd$y))
dd |> dplyr::group_by(x) |> dplyr::summarise(pct = mean(y), 
                                             oddsratio = pct/(1-pct)/overall_odds)
##   x       pct   oddsratio
## 1 a      0.95 17.2  
## 2 b      0.1   0.101
tidy(glm(y ~ x, family = binomial, dd), exponentiate = TRUE)
## (exponentiate = TRUE returns estimates as odds ratios 
##    rather than log-odds diffs)
##   term        estimate std.error statistic  p.value
## 1 (Intercept)     1.45     0.284      1.32 1.88e- 1
## 2 xa             13.1      0.284      9.07 1.23e-19

A little bit more digging in te Grotenhuis et al 2017 explains what's going on:

Likewise, if a researcher wishes to investigate obesity (BMI > 30) and, therefore, uses a dichotomy of BMI in a logistic regression analysis, then the mean to be tested against is the average of all log odds.

(emphasis added).

So the value that you compare against should not be the overall odds for the data set ($p/(1-p)$), but the geometric mean of the odds for each group (i.e. $\exp(\textrm{mean}(\log(p_i/(1-p_i)))$).

Continuing the example above:

dds <- (dd |> dplyr::group_by(x)
    |> dplyr::summarise(pct = mean(y),
                        odds = pct/(1-pct)))
overall_odds <- exp(mean(log(dds$odds)))
dds <- (dds |> dplyr::mutate(odds_adj = odds/overall_odds))

Now the odds_adj column matches the values calculated by glm():

  x       pct   odds odds_adj
1 a      0.95 19.0    13.1   
2 b      0.1   0.111   0.0765

te Grotenhuis, Manfred, Ben Pelzer, Rob Eisinga, Rense Nieuwenhuis, Alexander Schmidt-Catran, and Ruben Konig. “When Size Matters: Advantages of Weighted Effect Coding in Observational Studies.” International Journal of Public Health 62, no. 1 (January 1, 2017): 163–67. https://doi.org/10.1007/s00038-016-0901-1.

Related Question