Solved – Scaling predictors in mixed models

lme4-nlmemixed modelrrepeated measuresstandardization

I collected data over multiple locations and years on crop yield and want to regress yield as a function of rainfall and heat-stress which are in two different units. Suppose my dataframe has 5 columns: year, location, yield, rainfall and temperature. These are my steps:

 dat[,4:5] <- scale(dat[, 4:5], center = T, scale  T)
 model <- lmer(yield ~ rain + temp + (1|location) + (1|year), data = dat)

After I get this model and I want to use the model for prediction. Suppose I collect new data from different years or locations called dat1 which has 4 columns: year, location, rainfall and temperature.

My confusion is since the fitted model takes in standardised rainfall and temperature, how do I standardise these two variables in the new data dat1? Do I simple do:

dat1[,3:4] <- scale(dat1[,3:4], center = T, scale = T)
predict(model, newdata = dat1)

Or do I have to standarise the new data using mean and standard deviation of the original data dat?

Best Answer

Just because there is a warning message about predictors being on a different scale does not mean that you need to standardize them. Standardization can lead to difficulties with interpretation.

Generally, you only need to rescale the offending variable(s) by multiplying or dividing it by something appropriate, and since it is only a warning, you don't really need to do anything, although there are some situations where variables on vastly differing scales can cause numerical instability during model fitting.

You are correct that if you standardize the variables that are used as inputs for your model, you will have to standardize those in your test/prediction dataset too.

Model 1 (interaction, linear effects only)

# run model
library(lme4)
mdl <- lmer(yield ~ year * heat  + (1|location), data = my_data)

From your question, it seems like you're interested primarily in the overall fixed effects (not the estimates for individual locations, etc.), so I'll focus on that. If you want to get into visualizing the results for individual locations, I recommend this awesome blog post: https://tjmahr.github.io/plotting-partial-pooling-in-mixed-effects-models/

Here are the fixed effects as I estimated them (they're a bit different from your estimates, so your plot will look a bit different, too, but I think it's close enough to be useful):

> (fe <- summary(mdl)$coefficients[1:4,1])
  (Intercept)          year          heat     year:heat 
-74219.711477     38.293178   -117.339456      0.057416

Now let's plot it to take a look at that interaction. I'll choose to put heat on the x-axis and have three representative lines for low, medium, and high values of year, but you could switch and do it the other way if you prefer. I'll use those values (the full range of heat and the three example values of year) to generate predicted yield scores for each combination based on the fixed effects estimates from the model.

# select values for plotting
# full range of heat, and selected low, med and high values for year
plot_df <- expand.grid(heat=heat, year=c(min(year), mean(year), max(year))) %>% 
  mutate(yield = fe[1] + fe[2]*year + fe[3]*heat + fe[4]*year*heat)

Now I'll plot those predicted yield values as lines.

plot_interaction <- ggplot(plot_df, aes(y=yield, x = heat, color = as.factor(year))) + 
  geom_line(size=2) + 
  labs(color = "year") + 
  # tweak plot appearance
  scale_color_brewer(palette = "Dark2") + 
  theme(legend.position = "top") + 
  theme_classic()

plot_interaction

So what do we see here? First of all, the interaction is very small relative to the individual effects of year and heat --- the lines look almost parallel. But you know from the parameter estimates that it is significantly different from zero, so it's there even if it's small. heat has a negative effect when year = 0 (ideally, you will have centered both predictors before estimating the model, so this would be "at the mean of year"), and the positive interaction term indicates that as year increases, the effect of heat gets less negative, so it weakens. What you should look for in the plot is the slope to be a bit shallower for higher years than for low years.

Model 2 (linear and quadratic effects, with interactions)

The second model seems much more complicated, but it's actually just as easy to plot, almost! We'll do the same procedure as before, first getting the fixed effects from the model then generating some plotting data with the full range of heat, selected values of year, and predicted values for yield based on the fixed effect estimates.

First, since I don't have your real data, I'm re-generating another dataset that will (roughly) match the parameter estimates of the second model, so we can get a more or less accurate plot.

my_data <- base::expand.grid(year=year, heat=heat, location=location) %>% 
  mutate(yield = -97816.61 + 50.07*year + 20499.87*heat - 632.2*heat*heat - 10.22*year*heat +.31*year*heat*heat + rnorm(nrow(.), sd = 50))

Run the second model:

mdl <- lmer(yield ~ year*(heat + I(heat^2)) + (1|location), data = my_data)

Get the fixed effects:

fe <- summary(mdl)$coefficients[1:6,1]

> round(fe, 3)
   (Intercept)           year           heat      I(heat^2)      year:heat year:I(heat^2) 
    -95838.400         49.083      20410.924       -631.207        -10.176          0.310

Close enough. :) Now generate the plotting data.

# full range of heat, and selected low, med and high values for year
plot_df <- expand.grid(heat=heat, year=c(min(year), mean(year), max(year))) %>% 
  mutate(yield = fe[1] + fe[2]*year + fe[3]*heat + fe[4]*heat*heat + fe[5]*year*heat + fe[6]*year*heat*heat)

We can use exactly the same plotting code as before, just feed it the updated dataframe:

plot_interaction %+% plot_df

Now we can see that the general effect of heat on yield is negative, with an accelerating effect, so that there's a steeper negative slope at higher levels of heat. Moreover, we can see that the effect of heat depends on year (i.e. the interaction), such that heat drops off more sharply in earlier years than later years.

You should be able to generate plots for your own data using this code. I always find having a visual is extremely helpful when interpreting interactions, especially in models with more than a few coefficients (such as your model 2).

(Note that if the ranges for heat and year are pretty different in your data to what I came up with generating these data, then your plots might look quite different even though the parameter estimates are similar.)

A final note on plotting and interpreting interactions

My guess from your parameter estimates is that you didn't center year and heat before estimating the model, which is why I didn't either. But you probably should. Not only does it ease some model estimation issues like multicollinearity, it makes interpreting your interactions easier since 0 values are more meaningful. For example, the parameter estimates you're getting for heat are the predicted change in yield for each additional unit of heat when year = 0. Unless you've centered year (or you're studying heat and yields in antiquity), you're probably talking about values far outside of the range of your data. Your estimate for the effect of heat will be much more meaningful if you center first.

Solved – Dealing with heteroscedasticity in mixed models

The shape of the residuals seems to suggest that you have a lower/upper bound in your yield outcome variable. I.e., are some of the yields exactly zero? In this case, you may consider fitting a two-part mixed effects model for semi-continuous data. This model is, for example, available in the GLMMadaptive package. For an example check here.

To check the fit of the model in this case, it would be better to use the simulated scaled residuals provided by the DHARMa package. For an example that uses DHARMa with models fitted by GLMMadaptive you can check here.

Best Answer

Related Solutions

R – Interpreting Interactions in Linear Models vs Quadratic Models Using lme4-nlme

Model 1 (interaction, linear effects only)

Model 2 (linear and quadratic effects, with interactions)

A final note on plotting and interpreting interactions

Solved – Dealing with heteroscedasticity in mixed models

Related Question