predict() in lme4 does not work well unless the grouping factor specification is "realistic". If we use samples from the observed data, we get reasonable predictions. I think this is a bug in predict.merMod()
This is lme4 1.1-7
my.fit <- glmer(smoker ~ biomarker + year + sex + age + (1|id), data = df, family = binomial(link='logit'), nAGQ = 0)
## predict with the observed data
predictions <- predict(my.fit, type = "response")
mean(predictions, na.rm = TRUE)
[1] 0.1144976
## predict for year 1996 to year 2014
new.df <- df[sample(nrow(df), replace = TRUE), ]
sapply(1996:2014, function(x) {
new.df$year <- x
predictions <- predict(my.fit, newdata = new.df, type = "response", allow.new.levels=TRUE)
mean(predictions, na.rm = TRUE)
})
[1] 0.15785350 0.15374013 0.14974742 0.14586612 0.14208791 0.13840528 0.13481145 0.13130033
[9] 0.12786645 0.12450488 0.12121121 0.11798144 0.11481198 0.11169959 0.10864132 0.10563448
[17] 0.10267662 0.09976548 0.09689897
Compare that with what you get if you set id to a new value, not present in df when my.fit was evaluated.
new.df$id <- 1
sapply(1996:2014, function(x) {
new.df$year <- x
predictions <- predict(my.fit, newdata = new.df, type = "response", allow.new.levels=TRUE)
mean(predictions, na.rm = TRUE)
})
[1] 0.07292960 0.06660550 0.06079003 0.05544891 0.05054904 0.04605868 0.04194759 0.03818706
[9] 0.03475000 0.03161093 0.02874596 0.02613281 0.02375069 0.02158030 0.01960378 0.01780458
[17] 0.01616745 0.01467832 0.01332425
If you want confidence intervals around the predicted means, I can recommend the boot
package
my.bootstrap.predictions.f <- function(data, indices){
return(mean(predict(my.fit, newdata = data[indices, ], type = "response", allow.new.levels=TRUE), na.rm=TRUE))
}
## predict for year 1996 to year 2014
new.df <- df[sample(nrow(df), replace = TRUE), ]
time.period <- 1996:2014
my.results <- matrix(nrow=length(time.period), ncol = 4)
for(x in 1:length(time.period)){
my.results[x, 1] <- time.period[x]
new.df$year <- time.period[x]
## bootstrap using a realistic number of samples per year, say 20000
my.boot.obj <- boot(data = new.df[sample(nrow(new.df), 20000, replace = TRUE), ], statistic = my.bootstrap.predictions.f, R = 100)
my.results[x, 2] <- my.boot.obj[[1]]
my.results[x, 3:4] <- quantile(my.boot.obj[[2]], c(0.025, 0.975))
}
colnames(my.results) <- c("Year", "mean proportion", "lower.ci", "upper.ci")
Using R = 2 in the call to boot
, I got the following results:
my.results
Year mean proportion lower.ci upper.ci
[1,] 1996 0.15546203 0.15074711 0.15913858
[2,] 1997 0.15259172 0.15187967 0.15367255
[3,] 1998 0.14850053 0.14672908 0.14758575
[4,] 1999 0.14304792 0.14076788 0.14285565
[5,] 2000 0.14340505 0.14308830 0.14354602
[6,] 2001 0.13575446 0.13580401 0.13906530
[7,] 2002 0.13378345 0.13092585 0.13734706
[8,] 2003 0.13135301 0.13367143 0.13379709
[9,] 2004 0.12884709 0.12817521 0.12889604
[10,] 2005 0.12407854 0.12164363 0.12338582
[11,] 2006 0.11703560 0.11692974 0.12129822
[12,] 2007 0.11701868 0.11660557 0.12143424
[13,] 2008 0.11427061 0.11247954 0.11578390
[14,] 2009 0.10743174 0.10681350 0.11119430
[15,] 2010 0.10769067 0.10547794 0.10833410
[16,] 2011 0.10677037 0.10734999 0.10867513
[17,] 2012 0.10064624 0.09991072 0.10205153
[18,] 2013 0.09735750 0.09704462 0.10003968
[19,] 2014 0.09448228 0.09331708 0.09346973
Best Answer
The lsmeans package does produce the correct results with model objects from a number of packages, including lme4. If you have a fairly recent update of lsmeans installed, you can do
? models
and see information on what model objects are supported, and details of any special provisions.