Solved – Obtaining adjusted (predicted) proportions with lme4 – using the glmer-function

lme4-nlmelogisticmixed modelr

I aim to estimate the annual proportion of patients (% of patients) that are smokers in a population whose age and sex must be taken into account. In other words, I want to calculate the adjusted prevalence (%) of smoking each year. I have repeated measurements on the same individuals and want to model the individual as a random effect, which is why I use the lme4 package, more precisely the glmer function. The variable of main interest is "year" (period 1996 to 2014), which I need to model as a fixed effect.

Aim: Obtain adjusted proportions (%) of smokers each year.

Suppose the data set is named "df" and the year variable is converted to a factor.

I tried this code (generated with a slightly different data set than the attached one) to fit the model:

> smoke <- glmer(smoker ~ biomarker + year + sex + age + (1 | id), data
> = df, family = binomial, nAGQ = 1)

Fixed effects:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -6.201632   0.231582 -26.779  < 2e-16 ***
biomarker        -0.015364   0.008299  -1.851  0.06413 .  
yuar1997           0.648292   0.212400   3.052  0.00227 ** 
yuar1998          -0.586996   0.227217  -2.583  0.00978 ** 
yuar1999          -1.194309   0.216907  -5.506 3.67e-08 ***
yuar2000          -0.999889   0.217536  -4.596 4.30e-06 ***
yuar2001          -0.884453   0.203351  -4.349 1.37e-05 ***
yuar2002          -0.777464   0.199151  -3.904 9.47e-05 ***
yuar2003          -0.961869   0.194723  -4.940 7.83e-07 ***
yuar2004          -1.755470   0.197157  -8.904  < 2e-16 ***
yuar2005          -1.207833   0.189753  -6.365 1.95e-10 ***
yuar2006          -1.072532   0.187504  -5.720 1.07e-08 ***
yuar2007          -1.494477   0.189467  -7.888 3.08e-15 ***
yuar2008          -2.441916   0.191069 -12.780  < 2e-16 ***
yuar2009          -1.881562   0.187321 -10.045  < 2e-16 ***
yuar2010          -2.254924   0.187254 -12.042  < 2e-16 ***
yuar2011          -1.634935   0.184929  -8.841  < 2e-16 ***
yuar2012          -2.405588   0.187349 -12.840  < 2e-16 ***
yuar2013          -2.119775   0.186729 -11.352  < 2e-16 ***
yuar2014          -2.241768   0.210259 -10.662  < 2e-16 ***
sex              -0.071377   0.115975  -0.615  0.53826    
age              -0.012897   0.008011  -1.610  0.10742 

Using the predict function to obtain probability of being a smoker in 2005:

predict(smoke, data.frame(age=mean(df$age), year="2005", sex=mean(df$sex), biomarker=mean(df$biomarker, na.rm=T)), type="response", re.form = NA)

I obtain much too low probabilities of being a smoker a particular year:

0.0002233488

The same is true when using the lsmeans and effects package. Figures should be around 5–15% smokers.

In short, in the data set I'm aiming to obtain the proportions of smokers during different years, adjusted for differences in age, sex and the biomarker while accounting of repeated measurements.

I'd be extremely grateful for a solution to these problems.

Best Answer

predict() in lme4 does not work well unless the grouping factor specification is "realistic". If we use samples from the observed data, we get reasonable predictions. I think this is a bug in predict.merMod()

This is lme4 1.1-7

my.fit <- glmer(smoker ~ biomarker + year + sex + age + (1|id), data = df, family = binomial(link='logit'), nAGQ = 0)
## predict with the observed data
predictions <- predict(my.fit, type = "response")
mean(predictions, na.rm = TRUE)
[1] 0.1144976
## predict for year 1996 to year 2014
new.df <- df[sample(nrow(df), replace = TRUE), ]
sapply(1996:2014, function(x) {
  new.df$year <- x
  predictions <- predict(my.fit, newdata = new.df, type = "response", allow.new.levels=TRUE)
  mean(predictions, na.rm = TRUE)
})
[1] 0.15785350 0.15374013 0.14974742 0.14586612 0.14208791 0.13840528 0.13481145 0.13130033
[9] 0.12786645 0.12450488 0.12121121 0.11798144 0.11481198 0.11169959 0.10864132 0.10563448
[17] 0.10267662 0.09976548 0.09689897

Compare that with what you get if you set id to a new value, not present in df when my.fit was evaluated.

new.df$id <- 1
    sapply(1996:2014, function(x) {
      new.df$year <- x
  predictions <- predict(my.fit, newdata = new.df, type = "response", allow.new.levels=TRUE)
  mean(predictions, na.rm = TRUE)
})
[1] 0.07292960 0.06660550 0.06079003 0.05544891 0.05054904 0.04605868 0.04194759 0.03818706
[9] 0.03475000 0.03161093 0.02874596 0.02613281 0.02375069 0.02158030 0.01960378 0.01780458
[17] 0.01616745 0.01467832 0.01332425

If you want confidence intervals around the predicted means, I can recommend the boot package

my.bootstrap.predictions.f <- function(data, indices){
  return(mean(predict(my.fit, newdata = data[indices, ], type = "response", allow.new.levels=TRUE), na.rm=TRUE))
}
## predict for year 1996 to year 2014
new.df <- df[sample(nrow(df), replace = TRUE), ]
time.period <- 1996:2014
my.results <- matrix(nrow=length(time.period), ncol = 4)
for(x in 1:length(time.period)){
  my.results[x, 1] <- time.period[x]
  new.df$year <- time.period[x]
  ## bootstrap using a realistic number of samples per year, say 20000
  my.boot.obj <- boot(data = new.df[sample(nrow(new.df), 20000, replace = TRUE), ], statistic = my.bootstrap.predictions.f, R = 100)
  my.results[x, 2] <- my.boot.obj[[1]]
  my.results[x, 3:4] <- quantile(my.boot.obj[[2]], c(0.025, 0.975))
}
colnames(my.results) <- c("Year", "mean proportion", "lower.ci", "upper.ci")

Using R = 2 in the call to boot, I got the following results:

my.results
      Year mean proportion   lower.ci   upper.ci
 [1,] 1996      0.15546203 0.15074711 0.15913858
 [2,] 1997      0.15259172 0.15187967 0.15367255
 [3,] 1998      0.14850053 0.14672908 0.14758575
 [4,] 1999      0.14304792 0.14076788 0.14285565
 [5,] 2000      0.14340505 0.14308830 0.14354602
 [6,] 2001      0.13575446 0.13580401 0.13906530
 [7,] 2002      0.13378345 0.13092585 0.13734706
 [8,] 2003      0.13135301 0.13367143 0.13379709
 [9,] 2004      0.12884709 0.12817521 0.12889604
[10,] 2005      0.12407854 0.12164363 0.12338582
[11,] 2006      0.11703560 0.11692974 0.12129822
[12,] 2007      0.11701868 0.11660557 0.12143424
[13,] 2008      0.11427061 0.11247954 0.11578390
[14,] 2009      0.10743174 0.10681350 0.11119430
[15,] 2010      0.10769067 0.10547794 0.10833410
[16,] 2011      0.10677037 0.10734999 0.10867513
[17,] 2012      0.10064624 0.09991072 0.10205153
[18,] 2013      0.09735750 0.09704462 0.10003968
[19,] 2014      0.09448228 0.09331708 0.09346973