Solved – Simulating responses from a factorial experiment for power analysis

generalized linear modelrsimulationstatistical-power

I am thinking about a factorial experiment with two factors. Both factors are ordered factors. Factor 1 has two levels: small and large. Factor 2 has four levels: never, sometimes, frequently, and often. I also want to conduct the experiment in a number of locations, so I will include location as a sort of "block." I expect larger responses for increasing levels of both factors, and I expect an interaction effect, too. Thus, I have a model as follows: Response ~ Block + Factor1*Factor2 + error, which will have at least 40 observations, maybe 80, maybe 120, or so on until I can detect an effect.

I'll be measuring a number of response variables, most of which will be counts or 0 truncated (latency of response). I'm wondering how to simulate responses from my model with the expectation of a moderate effect size. I want to know what sample size is appropriate to detect a moderate effect from my treatments, but I'm not familiar enough with simulation to know where to start with such a problem. Any advice or direction or requests for more information would be much appreciated.

Additional information: I'm using R to do everything.

EDIT:
I implemented Mark T Patterson's answer to my question modifying it to fit my particular experimental setup and attempt to simulate poisson data, but I get warnings when I run the function. Fortunately, there are some relevant answers on CrossValidate: Generate data samples from Poisson regression. I'll keep learning how to simulate other data to match the other kinds of response variables I'll be measuring.

Best Answer

Here are a few ideas to get you started --

Simulation usually has two parts: we'll want a function to generate sample data, and then a function to analyze the results of our simulation.

This setup has a lot of flexibility -- you can (and should) modify the code to match the causal relationships you expect to find.

Here's an example of a function to generate data for a continuous outcome variable:

# a single draw of simulated data will have n observations
# we'll replicate this B times:
data.gen = function(n, B){

# before generating simulated data, make an empty matrix to 
# hold the p-values we're going to keep track of:
p.vals = matrix(rep(NA,B*3),ncol = 3)  

# we want to replicate the process B times:
for(i in 1:B){  

# for the setup I have, I'm assuming 4 evenly sized blocks
# this function ensures n is a multiple of 4:    
stopifnot(n%%4 == 0)  

# creating the sample data (independent vars) for a single draw:
block  = factor(rep(1:4, each = n/4))
fact.1 = rbinom(n,1,.5)
fact.2 = sample(0:3, n, replace = TRUE)
error  = rnorm(n,0,3)

# create a dataframe:
df = data.frame(block, fact.1,fact.2, error)

# code the block factors (there's probably a better way to do this)
df$block.1 = as.numeric(df$block == 1)
df$block.2 = as.numeric(df$block == 2)
df$block.3 = as.numeric(df$block == 3)
df$block.4 = as.numeric(df$block == 4)

# specify the true relationship between your dv and your regressors:
# note: my choices here were entirely arbitrary.. you will definitely
# want to change these:

# block variable coefficients:   
b.1 = 0.5
b.2 = -.5
b.3 = -1
b.4 = 0.5

# factor variable coefficients:
b.f1 = 3
b.f2 = 4

# interaction:
b.f1f2 = 2


# specifying the true relationship between your regressors and your DV:
df$y = with(df,block.1*b.1 + block.2*b.2 + block.3*b.3 + block.4*b.4 +
              b.f1*fact.1 + b.f2*fact.2 + b.f1f2*fact.1*fact.2 + error)


# fit a model:
lm.1 = lm(y ~ block + fact.1*fact.2, data = df)


# save the p-values from the regression in the matrix you created:
p.vals[i,] = as.vector(summary(lm.1)$coefficients[3:5,4])

}

# clean up the data a bit -- 
p.vals = data.frame(p.vals)
names(p.vals) = c("fact.1","fact.2","int")

# return the p-values:
return(p.vals)

}

Now, we're ready to run the simulation:

# running an experiment with n = 80, B = 1000 takes about 5 seconds:
sim.dat = data.gen(80,1000)

Finally, we can write whatever functions we want to check the power -- here, I just report the proportion of experiments that result in a p-value (for each factor, and the interaction term) less than 0.5:

sum(sim.dat$fact.1<.05)/length(sim.dat$fact.1)
sum(sim.dat$fact.2<.05)/length(sim.dat$fact.2)
sum(sim.dat$int<.05)/length(sim.dat$int)

The setup I've created doesn't capture the count data you're interested in.. if you'd like to build in that feature, start by modifying the df$y bit of the code. Also, you may want entirely different coefficients, or to test a different model entirely. Finally, rather than reporting the proportion of significant results, you may want to consider plotting the coefficients or p-values.

Hope this gets you started!

Related Question