Solved – How to be sure the sample size is large enough for conditional logistic regression

logisticrstatistical-power

I am performing a conditional logistics regression (CLR) case-control (1:3) study over a very large dataset that has been partitioned according to a set of study requirements (to follow). Each entry is a patient record: age, gender and up to nine disease entries, etc.

I take a male 20-29 population and I divide it into those with illness A (case) and those without illness A (four controls per case). I then use CLR to calculate the odds of an illness A if they have a second illness B. For large sample sizes, I am seeing really good odds ratio with CI for those with a headache (illness A) and hypertension (illness B); OR of around 1.8, which sits perfectly with current literature studies i.e., your odds of having a headache almost double if you have hypertension.

However, if I replace a headache (illness A) with a rare illness my sample size for those males 20-29 with illness A, drops through the floor e.g., I might only have 30 records. Therefore, the number of cases I have with rare illness A and illness B is very small (1 to 3 in some cases), causing the clogit to return Inf values or really large odds and very broad CI.

I was thinking of trying to determine using power analysis how large my sample size should be and that way do away with any scenarios where the set is too small Firstly, would this be the correct approach to take? And if so, how the do I calculate a power analysis for conditional logistic regression? I believe to determine a sample size I need to know my effect size (odds I want) which requires knowing prior probabilities?

Secondly, by manually partitioning my population by gender and then into age-groups on the assumption that everyone in this group will be the same, should I even both matching for age within the age-range in my case-control design? e.g., should I match a 22 yr old with another 22yr even when I am suggesting that a 28 yr old shouldn't be treated differently?

I have looked at matching for age within the age-group and not matching for age (both control and case populations are always between 20-29) in which the odds differ by 0.101:

Matching for age:

                     coef   exp(coef) se(coef)    z       p
hypertension_UNTRUE 0.845     2.327    0.150 5.63 1.8e-08

Likelihood ratio test=29.9  on 1 df, p=4.56e-08
n= 6633, number of events= 1670
                       2.5 % 97.5 %
hypertension_UNTRUE 1.734574 3.1224

Not matching for age but case and control are still always between 20-29:

                      coef exp(coef) se(coef)    z       p
hypertension_UNTRUE 0.895     2.448    0.154 5.81 6.1e-09

Likelihood ratio test=32.1  on 1 df, p=1.48e-08
n= 6615, number of events= 1670
                       2.5 %   97.5 %
hypertension_UNTRUE 1.810327 3.311272

Best Answer

Though I am generally familiar with the technique and its utility in performing analyses with stratified and matched samples, I have not used conditional logistic regression models in my own work. Therefore, I can give you an example of a simulation-based power analysis approach in R that demonstrates how you can build in some uncertainty about parameter estimates and features of the data into your power analysis. Hopefully, this approach provides enough of a framework that you can tweak it for your needs.

To simplify, let's say that I am interested in examining the relation between a single binomial risk variable, $x_1$ and the probability that $y$=1. And, I would like to control for scores on the continuous variable $x_2$ when testing my hypothesis about the relation between $x_1$ and $y$.

So essentially my logistic regression model of interest looks something like the following (there is more than one way to represent this model of course):

$$P(y_i=1)=\frac{1}{1+e^{-(\beta_0+\beta_1x_{1i}+\beta_2x_{2i}+r_i)}}$$

I am most interested in my power to detect a significant effect for my estimate of $\beta_1$ as a function of sample size:

sample.N<-seq(20, 250, by=10)

However, I am a bit uncertain about a number of features that could influence my statistical power.

Oftentimes researchers ignore various sources of uncertainty when performing a power analysis. I happen to think we can do better, though it does take more thought. To start, we can probably come up with some reasonable estimates for something like a base prevalence rate for our risk group ($x_1$) variable.

I am going to begin by assuming a relatively low risk factor rate of 35% (i.e., $P(x_1=1)=.35$). But I am not 100% confident of that estimate, so I build some uncertainty into my power analysis. Specifically, I am going to use a $Beta(35, 65)$ distribution to draw estimated values for $P(x_1=1)$.

prob.x1<-rbeta(1, 35, 65)
x1<-rbinom(n=sample.N[n], size = 1, prob = prob.x1)

I am also going to assume that there is some association between $x_1$ group membership and $x_2$ scores. To build this feature into my data I use a standardized difference score of .3 (which represents a small effect). I am, however, a little uncertain as to how good of an estimate that .3 is. So, I use a beta distribution to allow my analysis to consider $d$ values that are smaller than .3 (but still positive) and slightly larger (ranging up toward a moderate effect).

d.x1.x2<-rbeta(1, 15, 35)
x2<-rep(NA, sample.N[n])
x2[x1==1]<-rnorm(length(x2[x1==1]), mean = d.x1.x2) 
x2[x1==0]<-rnorm(length(x2[x1==0]), mean = 0)

I may have some general ideas about the coefficients I might expect to see in my model above. To start, $\hat{\beta_0}$ is the predicted log of the odds that $y=1$ when $x_1=0$ and $x_2=0$. It is difficult to think in terms of log odds, so instead I first specify an estimated value for $P(y=1|x_1=0, x_2=0)$, which again I am not 100% certain about. I use a $Beta(25, 75)$ distribution for which $E[Beta(25,75)]=.25$, which means that I generally believe that the probability that $y=1$ is .25 when $x_1=0$ and $x_2$ = 0.

p.y<-rbeta(1, 25, 75)
b0<-log(p.y/(1-p.y))#I then convert that value into logits

As for $\hat{\beta_1}$ and $\hat{\beta_2}$, I will posit conditional odds ratios of 2.5 and 1.75 respectively, meaning that in my approach to considering uncertainty I will draw from normal distributions of $\hat{\beta_1} \backsim N(ln(2.5), .5)$ and $\hat{\beta_2} \backsim N(ln(1.75), .5)$.

b1<-rnorm(n=1, mean=log(2.5), sd=.5)
b2<-rnorm(n=1, mean=log(1.75), sd=.5)

And finally, I should consider that my model is unlikely to fit the data perfectly. I know that my model residuals should have a mean of 0 and some variance, I am just uncertain about how much variance. So again, I build that uncertainty into the model by drawing standard deviations for my simulated data's residuals from a gamma distribution.

sd.e<-sqrt(rgamma(1, shape = 12, scale = 1/12)) 
e<-rnorm(n=sample.N[n], mean = 0, sd=sd.e)

Now I need to put this all together to create a vector $y$.

y<-ifelse(1/(1+exp(-(b0+b1*x1+b2*x2+e)))>=.50,1,0)
fit<-glm(y~x1+x2, family = 'binomial')

And I can save the $p$ value for my estimate of $\beta_1$ from the model.

p.val<-coef(summary(fit))[2,4]

Putting it all together in a single place with some annotation we get (note it will take a few minutes to run):

library(testit)
sample.N<-seq(20, 250, by=10)
n.sims<-10000
DF<-data.frame()
for(n in 1:length(sample.N)){
  power.vec<-vector()
  for(s in 1:n.sims){
    #Allow for uncertainty around my population estimate of P(x=1)
    #The mean of a beta(35,65) distribution is .35 (i.e., 35/(35+65)
    prob.x1<-rbeta(1, 35, 65)

    #------------------------------------------------------------------------------

    #Draw a random sample of size sample.N[n] for x1 based on population probability
    x1<-rbinom(n=sample.N[n], size = 1, prob = prob.x1)

    #------------------------------------------------------------------------------

    #Specify standardized difference on x2 as a function of x1
    #I assume a moderate effect size of d =.3, but assume that I may be wrong about it 
    #Again I am uncertain about this value so I give it a distribution:
    d.x1.x2<-rbeta(1, 15, 35)

    #------------------------------------------------------------------------------

    #And to use that difference to draw from two populations that have a standardized difference of .3:
    x2<-rep(NA, sample.N[n])
    x2[x1==1]<-rnorm(length(x2[x1==1]), mean = d.x1.x2) 
    x2[x1==0]<-rnorm(length(x2[x1==0]), mean = 0)

    #------------------------------------------------------------------------------
    #Now I specify estimates for b0, b1, and b2
    #b0 is a little tricky and it represents the probability in the model y=1 when x1=0 and x2=0
    #I start by specifying P(y=1)=.25 with some uncertainty
    p.y<-rbeta(1, 25, 75)
    b0<-log(p.y/(1-p.y))#I then convert that value into logits

    #for b1 and b2 I assume a normal distribution for each estimate
    #further I assume a larger coefficient for 
    b1<-rnorm(n=1, mean=log(2.5), sd=.5)
    b2<-rnorm(n=1, mean=log(1.75), sd=.5)

    #------------------------------------------------------------------------------

    #Finally I need to consider that my model will have some error here
    #I know that I am going to have some amount of error... but again not sure a specific value
    #So I choose to specify a standard deviation of my residuals using a gamma distribution
    sd.e<-sqrt(rgamma(1, shape = 12, scale = 1/12))
    #And now I can generate a residual for each case:
    e<-rnorm(n=sample.N[n], mean = 0, sd=sd.e)
    y<-ifelse(1/(1+exp(-(b0+b1*x1+b2*x2+e)))>=.50,1,0)
    fit<-glm(y~x1+x2, family = 'binomial')
    if(as.numeric(has_warning(glm(y~x1+x2, family = 'binomial')))==1){
      power.vec[s]<-NA
    }
    #get the p-value for b1
    else{
      p.val<-coef(summary(fit))[2,4]
      power.vec[s]<-ifelse(p.val<.05, 1, 0)
    }
  }
  temp.vec<-c(sample.N[n], mean(power.vec, na.rm = T), length(na.omit(power.vec)))
  DF<-rbind(DF, temp.vec)
}
colnames(DF)<-c('N', 'Power', 'Sim_success')
DF

Which returns:

     N     Power Sim_success
1   20 0.1121205        9026
2   30 0.2880463        9679
3   40 0.4113418        9875
4   50 0.5090398        9956
5   60 0.5645420        9978
6   70 0.5986381        9986
7   80 0.6480480        9990
8   90 0.6685011        9994
9  100 0.6968697        9999
10 110 0.7133280        9994
11 120 0.7343734        9999
12 130 0.7521000       10000
13 140 0.7569000       10000
14 150 0.7685000       10000
15 160 0.7843569        9998
16 170 0.7768777        9999
17 180 0.8012000       10000
18 190 0.8066000       10000
19 200 0.8130000       10000
20 210 0.8247000       10000
21 220 0.8196820        9999
22 230 0.8277000       10000
23 240 0.8361000       10000
24 250 0.8484848        9999

According to this analysis it would appear that a sample of around 180 would yield a power of .8 to detect a significant effect for my estimate of $\beta_1$.

I can also plot this power curve using the code below:

library(ggplot2)
g1<-ggplot(aes(x=N, y=Power), data = DF)
g2<-g1+stat_smooth(se=F, method = 'loess')+geom_hline(yintercept = .8, lty='dashed', color='red')
g3<-g2+xlab('Sample Size')+ylab('Power')
g4<-g3+ggtitle('Power Curve')
g4

which results in: enter image description here

Now I understand that this analysis is not 100% in line with your problem, but I believe the simulated data can be tweaked to match your analysis and data. Note that I have made a lot of non-trivial decisions in specifying just how uncertain I am about certain features of the data and estimates for the model. These decisions should be based on your knowledge of the subject matter and existing literature, whereas my choices were relatively arbitrary.