Solved – How to use predict() function to average over polynomial terms in a linear model, so that it does not over- or under-estimate the data

mixed modelpolynomialrregression

I have been having trouble with the predict function underestimating (or overestimating) the predictions from an lmer model with some polynomials. Update: The same trouble happens when I use lm models, i.e. it has nothing to do with mixed models. The example code below is still given with lmer models but one can ignore random effects as far as this question is concerned.

I have scaled data (posted here) that looks like this:

Terr      Date     Year            Age  
T.092     123      0.548425     -0.86392            
T.104     102      1.2072       -0.48185            
T.104     105      1.075445     -0.86392            
T.104     112      0.94369      -1.24599            
T.040     116     -0.2421        2.192652           
T.040     114     -0.37386       1.810581           
T.040     119     -0.50561       1.428509           
T.040     128      0.15316      -0.09978            
T.040     113      0.021405     -0.48185

I’m trying to determine how Year affects lay date after controlling for Age, with Terr (territory) as a random variable. I usually include polynomials and do model averaging, but whether I use a single model or do model averaging, the predict function gives predictions that are a bit lower or higher than they should be. I realize that the model below would not be a good model for this data, I’m just trying to provide a simplified example.

Below is my code

library(lme4)
m1 <- lmer(Date ~ (1|Terr) + Year + Age + I(Age^2), data=data)
new.dat <- data.frame(Year = c(-2,-1,0,1),
                  Age=mean(data$Age, na.rm=TRUE))
Predictions=predict(m1, newdata = new.dat, re.form=NA)
pred.l<-cbind(new.dat, Predictions)
pred.l  

      Year          Age Predictions
    1   -2 2.265676e-16    124.4439
    2   -1 2.265676e-16    123.2124
    3    0 2.265676e-16    121.9810
    4    1 2.265676e-16    120.7496

When plotted with the means, the graph looks like this:

graph1

When I use effects, I get a much better fit

library(effects)
ef.1c=effect(c("Year"), m1, xlevels=list(Year=-2:1))
pred.lc=data.frame(ef.1c)
pred.lc

      Year      fit        se    lower    upper
    1   -2 126.0226 0.6186425 124.8089 127.2363
    2   -1 124.7911 0.4291211 123.9493 125.6330
    3    0 123.5597 0.3298340 122.9126 124.2068
    4    1 122.3283 0.3957970 121.5518 123.1048

graph2

After much trial and error, I have discovered that the problem is with the Age polynomial, because when the Age polynomial is not included, the predicted and fitted are equal and both fit well. Below is the same model but with Age as a linear term.

m2 <- lmer(Date ~ (1|Terr) + Year + Age, data=data)
new.dat <- data.frame(Year = c(-2,-1,0,1),
                  Age=mean(data$Age, na.rm=TRUE))
Predictionsd=predict(m2, newdata = new.dat, re.form=NA)  
pred.ld<-cbind(new.dat, Predictionsd)
pred.ld

      Year          Age Predictionsd
    1   -2 2.265676e-16     125.9551
    2   -1 2.265676e-16     124.7653
    3    0 2.265676e-16     123.5755
    4    1 2.265676e-16     122.3857

library(effects)
ef.1e=effect(c("Year"), m2, xlevels=list(Year=-2:1))
pred.le=data.frame(ef.1e)
pred.le

      Year      fit        se    lower    upper
    1   -2 125.9551 0.6401008 124.6993 127.2109
    2   -1 124.7653 0.4436129 123.8950 125.6356
    3    0 123.5755 0.3406741 122.9072 124.2439
    4    1 122.3857 0.4093021 121.5827 123.1887

I do many similar analyses, and this issue with the predictions being slightly lower (or higher) than they should be often happens when Age is included as a polynomial. When I include a polynomial for Year, there is no problem and the predicted and fitted are equal, so I know the problem is not with all polynomials.

m3 <- lmer(Date ~ (1|Terr) + Year + I(Year^2) + Age, data=data)

new.dat <- data.frame(Year = c(-2,-1,0,1),
                  Age=mean(data$Age, na.rm=TRUE))
Predictionsf=predict(m3, newdata = new.dat, re.form=NA)  
pred.lf<-cbind(new.dat, Predictionsf)
pred.lf

      Year          Age Predictionsf
    1   -2 2.265676e-16     125.6103
    2   -1 2.265676e-16     124.8494
    3    0 2.265676e-16     123.7483
    4    1 2.265676e-16     122.3070

library(effects)
ef.1g=effect(c("Year"), m3, xlevels=list(Year=-2:1))
pred.lg=data.frame(ef.1g)
pred.lg

      Year      fit        se    lower    upper
    1   -2 125.6103 0.8206625 124.0003 127.2203
    2   -1 124.8494 0.4615719 123.9438 125.7549
    3    0 123.7483 0.4275858 122.9094 124.5871
    4    1 122.3070 0.4262110 121.4708 123.1431

I've looked for answers (e.g., here) but haven't found anything that is directly helpful. Does anyone have any insight?

Best Answer

Your problem has nothing to do with mixed models.

Your Age is centered to have mean zero. If you only have a linear term + Age in your model, then to get a prediction for each Year you can simply supply Age=0, which is what you are doing in your new.dat dataframe. And it works fine.

But if you have a quadratic term as well, then the combined contribution of Age+Age^2 will not on average be zero, and so your predicted curve will not match to the data. Instead, you need to supply some value $a$ such that $$\beta_1 a + \beta_2 a^2 = \Big \langle \beta_1 \text{Age} + \beta_2 \text{Age}^2 \Big \rangle = \Big \langle \beta_2 \text{Age}^2 \Big \rangle = \beta_2 \cdot 1 = \beta_2,$$ where $\beta_1$ and $\beta_2$ are the model coefficients for the linear and quadratic Age terms respectively (and I used the fact that Age is standardized, i.e. has zero mean and unit variance). This is a quadratic equation that yields a value of $a$ to put into the dataframe that is passed to predict().

As you wrote in the comments, this does the trick.

This approach can be extended in an obvious way to handle higher-order polynomials, but the equation(s) will get more complicated. I am not sure what strategy the effects package uses to handle it.