Solved – Mixed-effect model in R using lme for data count data with two fixed effects and repeated measures

lme4-nlmemixed modelrrepeated measures

I have no idea how to analyze this dataset.

I am asking if two genotypes, T and M, respond differently to a treatment, E2 (I also have a control, CON). All 36 animals were given both E2 and CON in a counterbalanced order and then their behavior was measured. The behavior was scored once every 30 seconds for 10 minutes as "yes" or "no" (coded "1" or "0").

I am interested in knowing if the treatment affects the genotypes differently and whether this effect changes over time.

I have tried running the ANOVA (see below) as a non-parametric test on my count data, but the data are not normally distributed and the results do not make sense.

model<-aov(behavior~genotype*treatment*time+Error(animal/(time*treatment)), data=dataset)

Therefore, I think that a mixed effects model is right. In this analysis, I think my fixed effects should be genotype and treatment. The random effects should be animal, treatment, and time. I also noted sex and age, but I'm not sure that matters for this analysis. The datafile is set up such that each line is a separate observation for each animal, every 30 seconds. See below:

>animal genotype sex age    treatment   time    behavior                                                                                
>1403   T   F   AHY CON 0   1                                                                               
>1404   T   F   AHY CON 0   1                                                                               
>1406   T   F   HY  CON 0   1                                                                               
>1407   T   F   AHY CON 0   1                                                                               
>1423   T   F   AHY CON 0   1                                                                               
>1425   T   F   HY  CON 0   1                                                                               
>1428   T   F   AHY CON 0   1                                                                               
>1431   T   F   AHY CON 0   1   

I have tried modeling the data using lme in R but I am not sure that I am nesting the random factors properly because the df for "genotype" is 28, but I only have 2 genotypes (so it should be df=1). This is my model:

mixed.model1 <- lme(fixed=behavior~genotype * treatment * time, 
random= ~ 1|animal/time/treatment, data=dataset)
summary(mixed.model1)

Linear mixed-effects model fit by REML
 Data: dataset 
       AIC      BIC    logLik
  2727.064 2779.666 -1351.532

Random effects:
 Formula: ~1 | animal
        (Intercept)
StdDev:    1.345537

 Formula: ~1 | time %in% animal
        (Intercept)
StdDev:   0.5004338

 Formula: ~1 | treatment %in% time %in% animal
        (Intercept)  Residual
StdDev:    2.030743 0.5704242

Fixed effects: behavior ~ genotype * treatment * time 
                           Value Std.Error  DF   t-value p-value
(Intercept)            2.1647059 0.4852943 296  4.460604  0.0000
genotypeT              0.3737557 0.7372150  28  0.506983  0.6161
treatmentE2           -0.3098039 0.4942422 296 -0.626826  0.5313
time                  -0.1465241 0.0578876 268 -2.531184  0.0119
genotypeT:treatmentE2  0.7969834 0.7508078 296  1.061501  0.2893
genotypeT:time        -0.1541752 0.0879375 268 -1.753236  0.0807
treatmentE2:time       0.0124777 0.0796543 296  0.156648  0.8756
genotypeT:treatmentE2:time -0.0367201 0.1210036 296 -0.303463  0.7618
 Correlation: 
                      (Intr) genoT treatE2 time gnW:E2 genoW: trtE2:
genoT                 -0.658                                          
treatE2               -0.509  0.335                                   
time                  -0.656  0.432  0.610                            
genoT:treatE2          0.335 -0.509 -0.658 -0.401                     
genoT:time             0.432 -0.656 -0.401 -0.658  0.610              
treatE2:time           0.451 -0.297 -0.886 -0.688  0.584  0.453       
genoT:treatE2:time    -0.297  0.451  0.584  0.453 -0.886 -0.688 -0.658

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-0.53883607 -0.11769447 -0.03528952  0.04858721  1.83383169 

Number of Observations: 600
Number of Groups: 
 animal        time %in% animal treat %in% time %in% animal
 30                         300                         600 

I am also concerned that the count data are not normally distributed. Should I use a GLM instead? If so, does that change the random effects? Please help. I have not seen any examples like this in any of the R books I have seen or on this blog.

Thanks in advance!!!

Best Answer

I will quickly address the general use of aov. When using aov in R, type I sum of squares are used. These are sequential, which means the order of variables will affect the results if the design is unbalanced (see here: http://goanna.cs.rmit.edu.au/~fscholer/anova.php). Type III sum of squares are sometimes preferred when there is an interaction and type II when there is not a significant interaction. This can be done in the car package with the function Anova (notice the capital A). This may be why your anova results did not make sense.

Now to address the question about mixed effect models. I would first recommend lme4, as I think the formula specification is easier to understand. For instance, the random effect would be + (1|animal/time/treatment). In regards to the degrees of freedom, it is not necessarily the case that your model is wrong. Douglas Bates, the author of lme4, has wrote extensively about the difficulties in calculating degrees of freedom in mixed models (https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html). This has also been discussed on this site (getting degrees of freedom from lmer). Because of this, the lme4 package does not provide p-values and, in order to calculate a p-value, extra steps are necessary such as sampling from the posterior. I am not sure if nlme is still being maintained, but it wouldn't hurt to email the authors.

In the event that the model is right, the tricky part will be interpreting the estimates (Interpreting the regression output from a mixed model when interactions between categorical variables are included). The reference category (i.e, the intercept) is going to be the first level of each factor. From what you have provided, this would be the first time point (I assume time is categorical because random effects are always factors), treatment = CON, and genotype = M. The p-value that is significant, for instance, is comparing time to this reference category. The question is whether this is a meaningful comparison? Using a package for Bayesian multilevel models, for instance brms or rstanarm (http://www.r-bloggers.com/r-users-will-now-inevitably-become-bayesians/), you could add posterior estimates together and use simple subtraction to obtain contrasts at each level of the factors.

This might not have been much help towards your initial question, but specification of random effects will generally change the estimate little unless there is great variation between levels of a random effect. Additionally random effects are not always straight forward (Minimum number of levels for a random effects factor?) or easy to define (What is the difference between fixed effect, random effect and mixed effect models?). If you still cannot get an answer to your question about random effects, you can try a sensitivity analysis. For instance, animal ID should be included as a random effect but the others are open to debate. You could check whether the estimates (eg, coefficients and confidence intervals) change drastically by only nesting some of the variables. If they do not, this would provide confidence in your model and you could mention the potential problem with the random effects in the discussion of your paper. For a more rigorous approach, you could use a likelihood ratio test comparing models that differ in regards to random effects (Likelihood ratio tests on linear mixed effect models). You can even use this test to determine whether time is significant. For instance, compare models that differ only in the inclusion of time.

Another option would be to use a gee, generalized estimating equation (r packages: gee & geepack), which might be appropriate here because the correlations between outcomes do not need to be correctly specified. The method is robust to "unknown" correlations. This is also ideal when samples are small (see here: http://epm.sagepub.com/content/76/1/64.short; https://en.wikipedia.org/wiki/Generalized_estimating_equation).

In regards to using different distributions, you could try glmer in the lme4 package with a negative binomial or Poisson distribution. The assumptions of a Poisson distribution are often violated (variance and mean must be close to equal). When there is over dispersion (variance is larger than the mean), the negative binomial distribution is preferred. Since you have 20 potential yes/no's, you should include the number of times possible as an offset which would model the counts as rates.

I hope this information can be of use for the manuscript!