R – Power Analysis for Multiple Logistic Regression via Simulation vs Power.Prop.Test

logisticmultiple regressionrstatistical-power

I am using simulation for a power analysis of an experiment tested with a multiple logistic regression model. I find substantial improvements in power for my model with covariate vs without covariate.

# packages
if (!require("pacman")) install.packages("pacman") 
pacman::p_load(broom, tidyverse)

Simulating Data with Binary outcome

set.seed(42)

N <- 1000
B_INT <- -.5  
B_COV_MC <- .1  
B_COND <- 0.43

df <- 
  tibble(
    x_int = 1,
    x_cond = rbinom(n = N, size = 1, pr = .5),
    x_cov = rnorm(n = N, mean = 40, sd = 15), # extract from your baseline data
    x_cov_mc = x_cov - mean(x_cov), # mean centering covariate
    y_lor = B_INT*x_int + B_COND*x_cond + B_COV_MC*x_cov_mc,
    y_pr = 1/(1+exp(-y_lor)),
    y = rbinom(N, 1, y_pr) 
  ) 

cor(df$x_cov_mc, df$y) # strong correlation of covariate with outcome

Fitting 1000 models with covariate

set.seed(42)
N_SAMPLING_DIST <- 1e3
p_vector <- vector(length = N_SAMPLING_DIST, mode = 'numeric')

for (i in 1:N_SAMPLING_DIST) {
df_cond_cov <- 
  tibble(
    x_int = 1,
    x_cond = rbinom(n = N, size = 1, pr = .5),
    x_cov = rnorm(n = N, mean = 40, sd = 15), # extract from your baseline data
    x_cov_mc = x_cov - mean(x_cov),
    y_lor = B_INT*x_int + B_COND*x_cond + B_COV_MC*x_cov_mc,
    y_pr = 1/(1+exp(-y_lor)),
    y = rbinom(N, 1, y_pr) 
  ) 

m <- glm(y ~ x_cond + x_cov, family = 'binomial', data = df_cond_cov)

p <- tidy(m) %>% 
  filter(term == 'x_cond') %>% 
  pull(p.value)

p_vector[i] <- p
}

cond_cov_power <- mean(p_vector < .05)
cond_cov_power

81% of simulated models with N = 1000 observed a significant effect for x_cond when controlling for covariates.

Fitting 1000 models without covariate

set.seed(42)
p_vector <- vector(length = N_SAMPLING_DIST, mode = 'numeric')

for (i in 1:N_SAMPLING_DIST) {
df_cond_cov <- 
  tibble(
    x_int = 1,
    x_cond = rbinom(n = N, size = 1, pr = .5),
    x_cov = rnorm(n = N, mean = 40, sd = 15), # extract from your baseline data
    x_cov_mc = x_cov - mean(x_cov),
    y_lor = B_INT*x_int + B_COND*x_cond + B_COV_MC*x_cov_mc,
    y_pr = 1/(1+exp(-y_lor)),
    y = rbinom(N, 1, y_pr) 
  ) 

m <- glm(y ~ x_cond, family = 'binomial', data = df_cond_cov)

p <- tidy(m) %>% 
  filter(term == 'x_cond') %>% 
  pull(p.value)

p_vector[i] <- p
}

cond_power <- mean(p_vector < .05)
cond_power

67% of simulated models with N = 1000 observed a significant effect for x_cond when NOT controlling for covariates.

This improvement in power is in the expected direction, but I am surprised that I have higher reported power from power.prop.test() in base R (see below).

power.prop.test() results

group_means <- 
    df_cond_cov %>% 
    group_by(x_cond) %>% 
    summarise(mean = mean(y))

group_means

Group means are 41% & 53%.

power.prop.test(p1 = group_means$mean[1], p2 = group_means$mean[2], n = N/2)

How can I have 97% power using just my treatment condition in power.prop.test(), when my logistic regression with covariates only achieved 81% power?

Best Answer

The group means you are using (41%, 53%) depend strongly on the seed.

These appear to be not actually the true marginal average probabilities of your model - note that these were averages of only 1000 samples.

I expect if you instead computed group means over a df_cond_cov that had e.g. 1 million rows, this would fix the issue and probably give you a more reasonable-seeming answer

-