Solved – Calculating Estimated Marginal Means from univariate data

lsmeansr

I am trying to calculate the estimated marginal means (aka least squared means) in R in order to do statistical analysis for a univariate dataset and am struggling as all the examples are from multivariate datasets.

My design is count data (wallaby scats from fixed quadrats) with repeated measures (samples taken once a year over three years). I am interested in the mean changes of scat counts over the three years. A reduced sample of my data looke like:

My mixed effects model looks like:

scatcount ~ year + (1|plot)

the random effect of plot is included to account for the repeated measures.

I was advised to calculate the estimated marginal means and am using the "emmeans" package in R. However I am struggling to get the code and process right for this. All the examples I can find are comparing two or more fixed and/or random variables so I'm struggling to apply it to my data. I am also unsure how to include the random effect of plot in the EMM calculation, or if I even need to? Any suggestions on how to do this in R would be much appreciate!

Best Answer

You need to do something like this:

library(lme4)
model <- lmer(scatcount ~ year + (1|plot), data = scatdata)

library(emmeans)
emmeans(model, "year")  # display the EMMs
pairs(.Last.value)      # pairwise comparisons

I'll consider adding a simple one-factor example somewhere in the documentation.

Note: The model is what accounts for the random effect. emmeans just summarizes results from a model; so if the model accounts properly for the random effect(s), you don't need to do anything extra in emmeans.

PS -- For that particular response variable, I'm guessing you will have pretty heterogeneous errors (evidenced, e.g., by a horn-shaped scatterplot of residuals versus fitted values). You probably should consider either a transformed response (e.g., sqrt(scatcount) ~ ...) or a generalized linear mixed model for a Poisson response (using, e.g., lme4::glmer()). Such models are also supported by the emmeans package but you may want to do additional things to obtain results back on the scatcount scale; see the vignette on transformations and link functions.

Related Solutions

Solved – Fitting and interpreting random effects in repeated measures and unbalanced ecological data set

It is worth noting that there are 2 types of random effects when using mixed effects models: random intercepts, and random coefficients/slopes. Random intercepts are used when there is clustering/grouping/nesting of observations. Whatever the thing is that the observations are clustered in (in your case Plot), would be used as a grouping variable (and is specified by placing it on the right side of the | symbol in the formula). Random intercepts will then be estimated for it. You can have multiple levels of clustering/grouping/nesting (if plots are nested within something else for example which doesn't seem to be the case with your setup), and you can have nesting within more than 1 factor at the same time, without one being nested in the other (crossed random effects).

Considering the model formula you wrote above:

model <- lmer(response ~ covariates + (1|Plot) + (1|Year) )

Here you have specified crossed random effects, observations are clustered within Plot and also within Year. It seems reasonable and correct to use Plot as a clustering/nesting/grouping variable, but not Year because (presumably) all your plots were sampled in every year, so there is no clustering/nesting/grouping within Year.

Considering the alternative model you suggested:

model <- lmer(response ~ covariates + (Year|Plot) )

Here, you still have observations clustered within Plot, but year is now specified as a random coefficient, giving rise to random slopes. This means that for each level of Plot the effect of Year is allowed to vary and the model will estimate now much it varies by estimating it's variance. Since you have 3 years, then depending on how your code the variable, you will have different interpretations - if it is a continuous variable, then you will get one estimate for the variance of each year. If you code it as a numeric (say 1,2,3) then you will get a single variance. It would usually make sense to include Year as a fixed effect too, otherwise you are imposing the restriction that the average effect of year is zero (since the random effects are assumed to come from a normal distribution with zero mean). Either model (coding Year as a factor or numeric) would make sense, the benefit of it being a factor is that it effectively allows for nonlinear change, while the drawback is that it has more things to estimate (which sometimes results in numerical/computational problems) and thus more for you to interpret. If there is little or no nonlinearity and the variances for each year are similar it would make more sense to use the numeric coding.

Finally, it is probably worth considering whether you want to include Year as a random effect in the first place. If you are expecting the effect of Year to differ a lot between different plots then it is a good idea, but usually a better starting point is just to use Year as a fixed effect, and assess how well that model fits the data before adding it as a random effect.

Solved – Difference between (factor|group) and (1|factor:group) specifications in lme4

At first sight, there are two main differences:

mod1 is a random intercept and random slope model while mod2 is a random intercept model (without random slope): (year|cluster) implicitly includes the intercept and thus expands to (1 + year|cluster). In this case you assume a different baseline for each cluster and allow the clusters to vary with respect to the year effect.
mod2 includes an interaction, mod1 doesn't: As mod2 only includes a random intercept for the cluster*year interaction you get the estimates for every unique combination of the factor levels (i.e. AA:year2, AA:year3, ...)

Note that mod1 estimates $k$ variances and $k(k-1)/2$ correlations for the $k$ random effects per cluster. If you constrain the cluster-related variances to be equal and all the correlations also to be equal (this is called compound symmetry), you get a model with many fewer (only two) variance/covariance parameters:

mod3 <- lmer(y ~ x + year + (1|cluster) + (1|year:cluster))

(note the similarity/difference compared to mod2) which can be useful when there is not enough information in the data to estimate mod1. For more on this see this excellent post by Reinhold Kliegl and also my follow-up question: Equivalence of (0 + factor|group) and (1|group) + (1|group:factor) random effect specifications in case of compound symmetry.

So mod3 can be seen as a restricted version of mod1. Your mod2 is restricted even further, but this one-parameter parametrization is arguably less realistic.

Best Answer

Related Solutions

Solved – Fitting and interpreting random effects in repeated measures and unbalanced ecological data set

Solved – Difference between (factor|group) and (1|factor:group) specifications in lme4

Related Question