Statistical Power – Simulation of Logistic Regression Power Analysis in Designed Experiments

generalized linear modellogisticrsimulationstatistical-power

This question is in response to an answer given by @Greg Snow in regards to a question I asked concerning power analysis with logistic regression and SAS Proc GLMPOWER.

If I am designing an experiment and will analze the results in a factorial logistic regression, how can I use simulation ( and here ) to conduct a power analysis?

Here is a simple example where there are two variables, the first takes on three possible values {0.03, 0.06, 0.09} and the second is a dummy indicator {0,1}. For each we estimate the response rate for each combination (# of responders / number of people marketed to). Further, we wish to have 3 times as many of the first combination of factors as the others (which can be considered equal) because this first combination is our tried and true version. This is a setup like given in the SAS course mentioned in the linked question.

enter image description here

The model that will be used to analyze the results will be a logistic regression, with main effects and interaction (response is 0 or 1).

mod <- glm(response ~ Var1 + Var2 + I(Var1*Var2))

How can I simulate a data set to use with this model to conduct a power analysis?

When I run this through SAS Proc GLMPOWER (using STDDEV =0.05486016
which corresponds to sqrt(p(1-p)) where p is the weighted average of the shown response rates):

data exemplar;
  input Var1 $ Var2 $ response weight;
  datalines;
    3 0 0.0025  3
    3 1 0.00395 1
    6 0 0.003   1
    6 1 0.0042  1
    9 0 0.0035  1
    9 1 0.002   1;
run;

proc glmpower data=exemplar;
  weight weight;
  class Var1 Var2;
  model response = Var1 | Var2;
  power
    power=0.8
    ntotal=.
    stddev=0.05486016;
run;

Note: GLMPOWER only will use class (nominal) variables so 3, 6, 9 above are treated as characters and could have been low, mid and high or any other three strings. When the real analysis is conducted, Var1 will be used a numeric (and we will include a polynomial term Var1*Var1) to account for any curvature.

The output from SAS is

enter image description here

So we see that we need 762,112 as our sample size (Var2 main effect is the hardest to estimate) with power equal to 0.80 and alpha equal to 0.05. We would allocate these so that 3 times as many were the baseline combination (i.e. 0.375 * 762112) and the remainder just fall equally into the other 5 combinations.

Best Answer

Preliminaries:

  • As discussed in the G*Power manual, there are several different types of power analyses, depending on what you want to solve for. (That is, $N$, the effect size $ES$, $\alpha$, and power exist in relation to each other; specifying any three of them will let you solve for the fourth.)

    • in your description, you want to know the appropriate $N$ to capture the response rates you specified with $\alpha=.05$, and power = 80%. This is a-priori power.
    • we can start with post-hoc power (determine power given $N$, response rates, & alpha) as this is conceptually simpler, and then move up
  • In addition to @GregSnow's excellent post, another really great guide to simulation-based power analyses on CV can be found here: Calculating statistical power. To summarize the basic ideas:

    1. figure out the effect you want to be able to detect
    2. generate N data from that possible world
    3. run the analysis you intend to conduct over those faux data
    4. store whether the results are 'significant' according to your chosen alpha
    5. repeat many ($B$) times & use the % 'significant' as an estimate of (post-hoc) power at that $N$
    6. to determine a-priori power, search over possible $N$'s to find the value that yields your desired power
  • Whether you will find significance on a particular iteration can be understood as the outcome of a Bernoulli trial with probability $p$ (where $p$ is the power). The proportion found over $B$ iterations allows us to approximate the true $p$. To get a better approximation, we can increase $B$, although this will also make the simulation take longer.

  • In R, the primary way to generate binary data with a given probability of 'success' is ?rbinom

    • E.g. to get the number of successes out of 10 Bernoulli trials with probability p, the code would be rbinom(n=10, size=1, prob=p), (you will probably want to assign the result to a variable for storage)
    • you can also generate such data less elegantly by using ?runif, e.g., ifelse(runif(1)<=p, 1, 0)
    • if you believe the results are mediated by a latent Gaussian variable, you could generate the latent variable as a function of your covariates with ?rnorm, and then convert them into probabilities with pnorm() and use those in your rbinom() code.
  • You state that you will "include a polynomial term Var1*Var1) to account for any curvature". There is a confusion here; polynomial terms can help us account for curvature, but this is an interaction term--it will not help us in this way. Nonetheless, your response rates require us to include both squared terms and interaction terms in our model. Specifically, your model will need to include: $var1^2$, $var1*var2$, and $var1^2*var2$, beyond the basic terms.

  • Although written in the context of a different question, my answer here: Difference between logit and probit models has a lot of basic information about these types of models.
  • Just as there are different kinds of Type I error rates when there are multiple hypotheses (e.g., per-contrast error rate, familywise error rate, & per-family error rate), so are there different kinds of power* (e.g., for a single pre-specified effect, for any effect, & for all effects). You could also seek for the power to detect a specific combination of effects, or for the power of a simultaneous test of the model as a whole. My guess from your description of your SAS code is that it is looking for the latter. However, from your description of your situation, I am assuming you want to detect the interaction effects at a minimum.

  • For a different way to think about issues related to power, see my answer here: How to report general precision in estimating correlations within a context of justifying sample size.

Simple post-hoc power for logistic regression in R:

Let's say your posited response rates represent the true situation in the world, and that you had sent out 10,000 letters. What is the power to detect those effects? (Note that I am famous for writing "comically inefficient" code, the following is intended to be easy to follow rather than optimized for efficiency; in fact, it's quite slow.)

set.seed(1)

repetitions = 1000
N = 10000
n = N/8
var1  = c(   .03,    .03,    .03,    .03,    .06,    .06,    .09,   .09)
var2  = c(     0,      0,      0,      1,      0,      1,      0,     1)
rates = c(0.0025, 0.0025, 0.0025, 0.00395, 0.003, 0.0042, 0.0035, 0.002)

var1    = rep(var1, times=n)
var2    = rep(var2, times=n)
var12   = var1**2
var1x2  = var1 *var2
var12x2 = var12*var2

significant = matrix(nrow=repetitions, ncol=7)

startT = proc.time()[3]
for(i in 1:repetitions){
  responses          = rbinom(n=N, size=1, prob=rates)
  model              = glm(responses~var1+var2+var12+var1x2+var12x2, 
                           family=binomial(link="logit"))
  significant[i,1:5] = (summary(model)$coefficients[2:6,4]<.05)
  significant[i,6]   = sum(significant[i,1:5])
  modelDev           = model$null.deviance-model$deviance
  significant[i,7]   = (1-pchisq(modelDev, 5))<.05
}
endT = proc.time()[3]
endT-startT

sum(significant[,1])/repetitions      # pre-specified effect power for var1
[1] 0.042
sum(significant[,2])/repetitions      # pre-specified effect power for var2
[1] 0.017
sum(significant[,3])/repetitions      # pre-specified effect power for var12
[1] 0.035
sum(significant[,4])/repetitions      # pre-specified effect power for var1X2
[1] 0.019
sum(significant[,5])/repetitions      # pre-specified effect power for var12X2
[1] 0.022
sum(significant[,7])/repetitions      # power for likelihood ratio test of model
[1] 0.168
sum(significant[,6]==5)/repetitions   # all effects power
[1] 0.001
sum(significant[,6]>0)/repetitions    # any effect power
[1] 0.065
sum(significant[,4]&significant[,5])/repetitions   # power for interaction terms
[1] 0.017

So we see that 10,000 letters doesn't really achieve 80% power (of any sort) to detect these response rates. (I am not sufficiently sure about what the SAS code is doing to be able to explain the stark discrepancy between these approaches, but this code is conceptually straightforward--if slow--and I have spent some time checking it, and I think these results are reasonable.)

Simulation-based a-priori power for logistic regression:

From here the idea is simply to search over possible $N$'s until we find a value that yields the desired level of the type of power you are interested in. Any search strategy that you can code up to work with this would be fine (in theory). Given the $N$'s that are going to be required to capture such small effects, it is worth thinking about how to do this more efficiently. My typical approach is simply brute force, i.e. to assess each $N$ that I might reasonably consider. (Note however, that I would typically only consider a small range, and I'm typically working with very small $N$'s--at least compared to this.)

Instead, my strategy here was to bracket possible $N$'s to get a sense of what the range of powers would be. Thus, I picked an $N$ of 500,000 and re-ran the code (initiating the same seed, n.b. this took an hour and a half to run). Here are the results:

sum(significant[,1])/repetitions      # pre-specified effect power for var1
[1] 0.115
sum(significant[,2])/repetitions      # pre-specified effect power for var2
[1] 0.091
sum(significant[,3])/repetitions      # pre-specified effect power for var12
[1] 0.059
sum(significant[,4])/repetitions      # pre-specified effect power for var1X2
[1] 0.606
sum(significant[,5])/repetitions      # pre-specified effect power for var12X2
[1] 0.913
sum(significant[,7])/repetitions      # power for likelihood ratio test of model
[1] 1
sum(significant[,6]==5)/repetitions   # all effects power
[1] 0.005
sum(significant[,6]>0)/repetitions    # any effect power
[1] 0.96
sum(significant[,4]&significant[,5])/repetitions   # power for interaction terms
[1] 0.606

We can see from this that the magnitude of your effects varies considerably, and thus your ability to detect them varies. For example, the effect of $var1^2$ is particularly difficult to detect, only being significant 6% of the time even with half a million letters. On the other hand, the model as a whole was always significantly better than the null model. The other possibilities are arrayed in between. Although most of the 'data' are thrown away on each iteration, a good bit of exploration is still possible. For example, we could use the significant matrix to assess the correlations between the probabilities of different variables being significant.

I should note in conclusion, that due to the complexity and large $N$ entailed in your situation, this was not as simple as I had suspected / claimed in my initial comment. However, you can certainly get the idea for how this can be done in general, and the issues involved in power analysis, from what I've put here. HTH.