I noticed some discrepancies between the expected output and actual output for binomial glm in R using customized weighted effect contrast coding.
The purpose of weighted effect coding is to use the overall sample mean as the reference group. All data is taken from the source package wec as shown below.
library(wec)
library(broom)
library(tidyverse)
data(PUMS)
#Convert wage to a binary variable
PUMS <- as_tibble(PUMS) %>%
mutate(wage = if_else(wage>50000, 1, 0))
#Switch to weighted effect contrasts
contrasts(PUMS$race) <- contr.wec(PUMS$race, "Black")
#Overall binomial percent
overall <- PUMS %>%
summarise(Percent = sum(wage)/n())
overall_odds <- overall[[1,1]]/(1-overall[[1,1]])
#Calculated perfectly when treating the 0-1 binomial variable as linear
#The (intercept matches overall[[1, 1]])
tidy(lm(wage ~ race, data = PUMS))
#There are some odd slight discrepancies between the Odds_Ratio columns
tidy(glm(wage ~ race, family = binomial, data = PUMS)) %>%
mutate(Odds_Ratio = exp(estimate))
PUMS %>%
group_by(race) %>%
summarise(Percent = sum(wage)/n()) %>%
mutate(Odds_Ratio = (Percent/(1 - Percent))/overall_odds)
As you can see if you run the code, the output from the last two chunks is slightly off for some reason. For example, the odds ratio for the Asian group should be 1.59 but it appears as 1.61 in the glm model.
However, if the binomial outcome variable is treated as continuous a simple lm call predicts the odds ratios perfectly (the intercept matches overall[[1, 1]]).
Is there some mechanistic reason or rounding error that causes binomial regression to behave this way, and can I adjust settings to fix it? I would really appreciate help in this as it is critical for my research.
EDIT:
If anybody ever encounters this issue again, here is a modification of Ben Bolker's answer to get the weighted overall odds and expected coefficients for multiple groups, not just the unweighted odds in regular two-variable effect coding.
library(broom)
library(wec)
library(tidyverse)
set.seed(11)
dd <- data.frame(x = factor(rep(c("a", "b", "c"), times = c(153, 46, 51))),
y = rbinom(250, 1, 0.5))
contrasts(dd$x) <- wec::contr.wec(dd$x, omitted = "b")
dds <- (dd |>
dplyr::group_by(x) |>
dplyr::summarise(pct = mean(y), odds = pct/(1-pct), count = n()))
#weighted
overall_odds <- exp(weighted.mean(log(dds$odds), dds$count))
dds |> dplyr::mutate(estimate = odds/overall_odds) |>
relocate(estimate, .before = pct)
tidy(glm(y ~ x, family = binomial, dd), exponentiate = TRUE)
Best Answer
I made up a simpler, more extreme example that strongly hints that the problem is nonlinear averaging, not some rounding issue.
A little bit more digging in te Grotenhuis et al 2017 explains what's going on:
(emphasis added).
So the value that you compare against should not be the overall odds for the data set ($p/(1-p)$), but the geometric mean of the odds for each group (i.e. $\exp(\textrm{mean}(\log(p_i/(1-p_i)))$).
Continuing the example above:
Now the
odds_adj
column matches the values calculated byglm()
:te Grotenhuis, Manfred, Ben Pelzer, Rob Eisinga, Rense Nieuwenhuis, Alexander Schmidt-Catran, and Ruben Konig. “When Size Matters: Advantages of Weighted Effect Coding in Observational Studies.” International Journal of Public Health 62, no. 1 (January 1, 2017): 163–67. https://doi.org/10.1007/s00038-016-0901-1.