R – Power Analysis for Multiple Logistic Regression via Simulation vs Power.Prop.Test

logisticmultiple regressionrstatistical-power

I am using simulation for a power analysis of an experiment tested with a multiple logistic regression model. I find substantial improvements in power for my model with covariate vs without covariate.

# packages
if (!require("pacman")) install.packages("pacman") 
pacman::p_load(broom, tidyverse)

Simulating Data with Binary outcome

set.seed(42)

N <- 1000
B_INT <- -.5  
B_COV_MC <- .1  
B_COND <- 0.43

df <- 
  tibble(
    x_int = 1,
    x_cond = rbinom(n = N, size = 1, pr = .5),
    x_cov = rnorm(n = N, mean = 40, sd = 15), # extract from your baseline data
    x_cov_mc = x_cov - mean(x_cov), # mean centering covariate
    y_lor = B_INT*x_int + B_COND*x_cond + B_COV_MC*x_cov_mc,
    y_pr = 1/(1+exp(-y_lor)),
    y = rbinom(N, 1, y_pr) 
  ) 

cor(df$x_cov_mc, df$y) # strong correlation of covariate with outcome

Fitting 1000 models with covariate

set.seed(42)
N_SAMPLING_DIST <- 1e3
p_vector <- vector(length = N_SAMPLING_DIST, mode = 'numeric')

for (i in 1:N_SAMPLING_DIST) {
df_cond_cov <- 
  tibble(
    x_int = 1,
    x_cond = rbinom(n = N, size = 1, pr = .5),
    x_cov = rnorm(n = N, mean = 40, sd = 15), # extract from your baseline data
    x_cov_mc = x_cov - mean(x_cov),
    y_lor = B_INT*x_int + B_COND*x_cond + B_COV_MC*x_cov_mc,
    y_pr = 1/(1+exp(-y_lor)),
    y = rbinom(N, 1, y_pr) 
  ) 

m <- glm(y ~ x_cond + x_cov, family = 'binomial', data = df_cond_cov)

p <- tidy(m) %>% 
  filter(term == 'x_cond') %>% 
  pull(p.value)

p_vector[i] <- p
}

cond_cov_power <- mean(p_vector < .05)
cond_cov_power

81% of simulated models with N = 1000 observed a significant effect for x_cond when controlling for covariates.

Fitting 1000 models without covariate

set.seed(42)
p_vector <- vector(length = N_SAMPLING_DIST, mode = 'numeric')

for (i in 1:N_SAMPLING_DIST) {
df_cond_cov <- 
  tibble(
    x_int = 1,
    x_cond = rbinom(n = N, size = 1, pr = .5),
    x_cov = rnorm(n = N, mean = 40, sd = 15), # extract from your baseline data
    x_cov_mc = x_cov - mean(x_cov),
    y_lor = B_INT*x_int + B_COND*x_cond + B_COV_MC*x_cov_mc,
    y_pr = 1/(1+exp(-y_lor)),
    y = rbinom(N, 1, y_pr) 
  ) 

m <- glm(y ~ x_cond, family = 'binomial', data = df_cond_cov)

p <- tidy(m) %>% 
  filter(term == 'x_cond') %>% 
  pull(p.value)

p_vector[i] <- p
}

cond_power <- mean(p_vector < .05)
cond_power

67% of simulated models with N = 1000 observed a significant effect for x_cond when NOT controlling for covariates.

This improvement in power is in the expected direction, but I am surprised that I have higher reported power from power.prop.test() in base R (see below).

power.prop.test() results

group_means <- 
    df_cond_cov %>% 
    group_by(x_cond) %>% 
    summarise(mean = mean(y))

group_means

Group means are 41% & 53%.

power.prop.test(p1 = group_means$mean[1], p2 = group_means$mean[2], n = N/2)

How can I have 97% power using just my treatment condition in power.prop.test(), when my logistic regression with covariates only achieved 81% power?

Best Answer

The group means you are using (41%, 53%) depend strongly on the seed.

These appear to be not actually the true marginal average probabilities of your model - note that these were averages of only 1000 samples.

I expect if you instead computed group means over a df_cond_cov that had e.g. 1 million rows, this would fix the issue and probably give you a more reasonable-seeming answer

Preliminaries:

As discussed in the G*Power manual, there are several different types of power analyses, depending on what you want to solve for. (That is, $N$, the effect size $ES$, $\alpha$, and power exist in relation to each other; specifying any three of them will let you solve for the fourth.)
- in your description, you want to know the appropriate $N$ to capture the response rates you specified with $\alpha=.05$, and power = 80%. This is a-priori power.
- we can start with post-hoc power (determine power given $N$, response rates, & alpha) as this is conceptually simpler, and then move up
In addition to @GregSnow's excellent post, another really great guide to simulation-based power analyses on CV can be found here: Calculating statistical power. To summarize the basic ideas:
1. figure out the effect you want to be able to detect
2. generate N data from that possible world
3. run the analysis you intend to conduct over those faux data
4. store whether the results are 'significant' according to your chosen alpha
5. repeat many ($B$) times & use the % 'significant' as an estimate of (post-hoc) power at that $N$
6. to determine a-priori power, search over possible $N$'s to find the value that yields your desired power
Whether you will find significance on a particular iteration can be understood as the outcome of a Bernoulli trial with probability $p$ (where $p$ is the power). The proportion found over $B$ iterations allows us to approximate the true $p$. To get a better approximation, we can increase $B$, although this will also make the simulation take longer.
In R, the primary way to generate binary data with a given probability of 'success' is ?rbinom
- E.g. to get the number of successes out of 10 Bernoulli trials with probability p, the code would be rbinom(n=10, size=1, prob=p), (you will probably want to assign the result to a variable for storage)
- you can also generate such data less elegantly by using ?runif, e.g., ifelse(runif(1)<=p, 1, 0)
- if you believe the results are mediated by a latent Gaussian variable, you could generate the latent variable as a function of your covariates with ?rnorm, and then convert them into probabilities with pnorm() and use those in your rbinom() code.
You state that you will "include a polynomial term Var1*Var1) to account for any curvature". There is a confusion here; polynomial terms can help us account for curvature, but this is an interaction term--it will not help us in this way. Nonetheless, your response rates require us to include both squared terms and interaction terms in our model. Specifically, your model will need to include: $var1^2$, $var1*var2$, and $var1^2*var2$, beyond the basic terms.
Although written in the context of a different question, my answer here: Difference between logit and probit models has a lot of basic information about these types of models.
Just as there are different kinds of Type I error rates when there are multiple hypotheses (e.g., per-contrast error rate, familywise error rate, & per-family error rate), so are there different kinds of power* (e.g., for a single pre-specified effect, for any effect, & for all effects). You could also seek for the power to detect a specific combination of effects, or for the power of a simultaneous test of the model as a whole. My guess from your description of your SAS code is that it is looking for the latter. However, from your description of your situation, I am assuming you want to detect the interaction effects at a minimum.
- *reference: Maxwell, S.E. (2004). The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychological Methods, 9, 2, pp. 147-163.
- your effects are quite small (not to be confused with the low response rates), so we will find it difficult to achieve good power.
- Note that, although these all sound fairly similar, they are very much not the same (e.g., it is very possible to get a significant model with no significant effects--discussed here: How can a regression be significant yet all predictors be non-significant?, or significant effects but where the model is not significant--discussed here: Significance of coefficients in linear regression: significant t-test vs non-significant F-statistic), which will be illustrated below.
For a different way to think about issues related to power, see my answer here: How to report general precision in estimating correlations within a context of justifying sample size.

Simple post-hoc power for logistic regression in R:

Let's say your posited response rates represent the true situation in the world, and that you had sent out 10,000 letters. What is the power to detect those effects? (Note that I am famous for writing "comically inefficient" code, the following is intended to be easy to follow rather than optimized for efficiency; in fact, it's quite slow.)

set.seed(1)

repetitions = 1000
N = 10000
n = N/8
var1  = c(   .03,    .03,    .03,    .03,    .06,    .06,    .09,   .09)
var2  = c(     0,      0,      0,      1,      0,      1,      0,     1)
rates = c(0.0025, 0.0025, 0.0025, 0.00395, 0.003, 0.0042, 0.0035, 0.002)

var1    = rep(var1, times=n)
var2    = rep(var2, times=n)
var12   = var1**2
var1x2  = var1 *var2
var12x2 = var12*var2

significant = matrix(nrow=repetitions, ncol=7)

startT = proc.time()[3]
for(i in 1:repetitions){
  responses          = rbinom(n=N, size=1, prob=rates)
  model              = glm(responses~var1+var2+var12+var1x2+var12x2, 
                           family=binomial(link="logit"))
  significant[i,1:5] = (summary(model)$coefficients[2:6,4]<.05)
  significant[i,6]   = sum(significant[i,1:5])
  modelDev           = model$null.deviance-model$deviance
  significant[i,7]   = (1-pchisq(modelDev, 5))<.05
}
endT = proc.time()[3]
endT-startT

sum(significant[,1])/repetitions      # pre-specified effect power for var1
[1] 0.042
sum(significant[,2])/repetitions      # pre-specified effect power for var2
[1] 0.017
sum(significant[,3])/repetitions      # pre-specified effect power for var12
[1] 0.035
sum(significant[,4])/repetitions      # pre-specified effect power for var1X2
[1] 0.019
sum(significant[,5])/repetitions      # pre-specified effect power for var12X2
[1] 0.022
sum(significant[,7])/repetitions      # power for likelihood ratio test of model
[1] 0.168
sum(significant[,6]==5)/repetitions   # all effects power
[1] 0.001
sum(significant[,6]>0)/repetitions    # any effect power
[1] 0.065
sum(significant[,4]&significant[,5])/repetitions   # power for interaction terms
[1] 0.017

So we see that 10,000 letters doesn't really achieve 80% power (of any sort) to detect these response rates. (I am not sufficiently sure about what the SAS code is doing to be able to explain the stark discrepancy between these approaches, but this code is conceptually straightforward--if slow--and I have spent some time checking it, and I think these results are reasonable.)

Simulation-based a-priori power for logistic regression:

From here the idea is simply to search over possible $N$'s until we find a value that yields the desired level of the type of power you are interested in. Any search strategy that you can code up to work with this would be fine (in theory). Given the $N$'s that are going to be required to capture such small effects, it is worth thinking about how to do this more efficiently. My typical approach is simply brute force, i.e. to assess each $N$ that I might reasonably consider. (Note however, that I would typically only consider a small range, and I'm typically working with very small $N$'s--at least compared to this.)

Instead, my strategy here was to bracket possible $N$'s to get a sense of what the range of powers would be. Thus, I picked an $N$ of 500,000 and re-ran the code (initiating the same seed, n.b. this took an hour and a half to run). Here are the results:

sum(significant[,1])/repetitions      # pre-specified effect power for var1
[1] 0.115
sum(significant[,2])/repetitions      # pre-specified effect power for var2
[1] 0.091
sum(significant[,3])/repetitions      # pre-specified effect power for var12
[1] 0.059
sum(significant[,4])/repetitions      # pre-specified effect power for var1X2
[1] 0.606
sum(significant[,5])/repetitions      # pre-specified effect power for var12X2
[1] 0.913
sum(significant[,7])/repetitions      # power for likelihood ratio test of model
[1] 1
sum(significant[,6]==5)/repetitions   # all effects power
[1] 0.005
sum(significant[,6]>0)/repetitions    # any effect power
[1] 0.96
sum(significant[,4]&significant[,5])/repetitions   # power for interaction terms
[1] 0.606

We can see from this that the magnitude of your effects varies considerably, and thus your ability to detect them varies. For example, the effect of $var1^2$ is particularly difficult to detect, only being significant 6% of the time even with half a million letters. On the other hand, the model as a whole was always significantly better than the null model. The other possibilities are arrayed in between. Although most of the 'data' are thrown away on each iteration, a good bit of exploration is still possible. For example, we could use the significant matrix to assess the correlations between the probabilities of different variables being significant.

I should note in conclusion, that due to the complexity and large $N$ entailed in your situation, this was not as simple as I had suspected / claimed in my initial comment. However, you can certainly get the idea for how this can be done in general, and the issues involved in power analysis, from what I've put here. HTH.

Simulating Data with Binary outcome

Fitting 1000 models with covariate

Fitting 1000 models without covariate

power.prop.test() results

Best Answer

Related Solutions

Power Analysis for Ordinal Logistic Regression

Statistical Power – Simulation of Logistic Regression Power Analysis in Designed Experiments

Preliminaries:

Simple post-hoc power for logistic regression in R:

Simulation-based a-priori power for logistic regression:

Related Question