Solved – Simulating responses from a factorial experiment for power analysis

generalized linear modelrsimulationstatistical-power

I am thinking about a factorial experiment with two factors. Both factors are ordered factors. Factor 1 has two levels: small and large. Factor 2 has four levels: never, sometimes, frequently, and often. I also want to conduct the experiment in a number of locations, so I will include location as a sort of "block." I expect larger responses for increasing levels of both factors, and I expect an interaction effect, too. Thus, I have a model as follows: Response ~ Block + Factor1*Factor2 + error, which will have at least 40 observations, maybe 80, maybe 120, or so on until I can detect an effect.

I'll be measuring a number of response variables, most of which will be counts or 0 truncated (latency of response). I'm wondering how to simulate responses from my model with the expectation of a moderate effect size. I want to know what sample size is appropriate to detect a moderate effect from my treatments, but I'm not familiar enough with simulation to know where to start with such a problem. Any advice or direction or requests for more information would be much appreciated.

Additional information: I'm using R to do everything.

EDIT:
I implemented Mark T Patterson's answer to my question modifying it to fit my particular experimental setup and attempt to simulate poisson data, but I get warnings when I run the function. Fortunately, there are some relevant answers on CrossValidate: Generate data samples from Poisson regression. I'll keep learning how to simulate other data to match the other kinds of response variables I'll be measuring.

Best Answer

Here are a few ideas to get you started --

Simulation usually has two parts: we'll want a function to generate sample data, and then a function to analyze the results of our simulation.

This setup has a lot of flexibility -- you can (and should) modify the code to match the causal relationships you expect to find.

Here's an example of a function to generate data for a continuous outcome variable:

# a single draw of simulated data will have n observations
# we'll replicate this B times:
data.gen = function(n, B){

# before generating simulated data, make an empty matrix to 
# hold the p-values we're going to keep track of:
p.vals = matrix(rep(NA,B*3),ncol = 3)  

# we want to replicate the process B times:
for(i in 1:B){  

# for the setup I have, I'm assuming 4 evenly sized blocks
# this function ensures n is a multiple of 4:    
stopifnot(n%%4 == 0)  

# creating the sample data (independent vars) for a single draw:
block  = factor(rep(1:4, each = n/4))
fact.1 = rbinom(n,1,.5)
fact.2 = sample(0:3, n, replace = TRUE)
error  = rnorm(n,0,3)

# create a dataframe:
df = data.frame(block, fact.1,fact.2, error)

# code the block factors (there's probably a better way to do this)
df$block.1 = as.numeric(df$block == 1)
df$block.2 = as.numeric(df$block == 2)
df$block.3 = as.numeric(df$block == 3)
df$block.4 = as.numeric(df$block == 4)

# specify the true relationship between your dv and your regressors:
# note: my choices here were entirely arbitrary.. you will definitely
# want to change these:

# block variable coefficients:   
b.1 = 0.5
b.2 = -.5
b.3 = -1
b.4 = 0.5

# factor variable coefficients:
b.f1 = 3
b.f2 = 4

# interaction:
b.f1f2 = 2


# specifying the true relationship between your regressors and your DV:
df$y = with(df,block.1*b.1 + block.2*b.2 + block.3*b.3 + block.4*b.4 +
              b.f1*fact.1 + b.f2*fact.2 + b.f1f2*fact.1*fact.2 + error)


# fit a model:
lm.1 = lm(y ~ block + fact.1*fact.2, data = df)


# save the p-values from the regression in the matrix you created:
p.vals[i,] = as.vector(summary(lm.1)$coefficients[3:5,4])

}

# clean up the data a bit -- 
p.vals = data.frame(p.vals)
names(p.vals) = c("fact.1","fact.2","int")

# return the p-values:
return(p.vals)

}

Now, we're ready to run the simulation:

# running an experiment with n = 80, B = 1000 takes about 5 seconds:
sim.dat = data.gen(80,1000)

Finally, we can write whatever functions we want to check the power -- here, I just report the proportion of experiments that result in a p-value (for each factor, and the interaction term) less than 0.5:

sum(sim.dat$fact.1<.05)/length(sim.dat$fact.1)
sum(sim.dat$fact.2<.05)/length(sim.dat$fact.2)
sum(sim.dat$int<.05)/length(sim.dat$int)

The setup I've created doesn't capture the count data you're interested in.. if you'd like to build in that feature, start by modifying the df$y bit of the code. Also, you may want entirely different coefficients, or to test a different model entirely. Finally, rather than reporting the proportion of significant results, you may want to consider plotting the coefficients or p-values.

Hope this gets you started!

Related Solutions

Solved – Nested mixed effects anova with lmer

Answer Q1: To know if the model is correct we need more information. It depends on how you randomized the TreatN and TreatH combinations in each site (i.e. FactorS). If you assigned the TreatN and TreatH combinations as a completely randomized design then I would say that yes your model is correct. If you randomized the the treatment combinations as a randomized complete block design I think the model should be:

m2 = lmer(area ~ treatN*treatH + (1|FactorS/replicate), 
          data = data)

because your replicate/block is nested within location.

Answer Q2: lmer fits mixed-effect models and is a type of generalized linear mixed model with a Gaussian distribution. The function glm() can't fit random effects.

Answer Q3: You can't fit the model you specified using the glm() function unless you treat FactorS as a fixed-effect; you could use glmer() function doing the following:

m4 = glmer(area ~ treatN*treatH + (1|FactorS), 
           data = data, family = Gamma(link = "identity"))

If your data follows a normal distribution you can also use the Gaussian distribtion in glmer, which is the same as doing the analysis in lmer.

m4 = glmer(area ~ treatN*treatH + (1|FactorS), 
           data = data, family = gaussian(link = "identity"))

(this will give you a warning saying you should just use lmer instead).

Here are a few links that can help you decide what distribution to use in your analysis:
When to use gamma GLMs?

Solved – Constructing 3-way ANOVA when design is not fully factorial

You are correct that since not all combination appear in your data-set you will not be able to test certain interactions and contrasts. I am afraid there is no way round this with the constraints you had.

Do you really want to add the three-way interaction to your model? It is up to you but higher-order interactions can be hard to interpret especially when you have missing cells in the design. I would have included the two-ways only. (A+B+Block)^2 should do it. Note that if you fit a model with just the main effects (A + B + Block) there is still one effect which cannot be estimated since A3 and B3 are aliased, they only ever occur together so cannot be separated. Moving on to including the two-ways we find more problems. There is only one estimable effect for the A B interaction because none of A2B3, A3B2 and A3B3 are separately estimable because of the way they co--occur. There are also problems with the interactions with Block

You should not try to remove the main effect of Block. Search for the concept of marginality for explanations of why or look at this Q&A Including the interaction but not the main effects in a model for some discussion.

Best Answer

Related Solutions

Solved – Nested mixed effects anova with lmer

Solved – Constructing 3-way ANOVA when design is not fully factorial

Related Question