You can do this somewhat by hand if by taking advantage of the lapply
functionality in R and the list-structure returned by the Amelia
multiple imputation package. Here's a quick example script.
library(Amelia)
library(lme4)
library(merTools)
library(plyr) # for collapsing estimates
Amelia
is similar to mice
so you can just substitute your variables in from the mice
call here -- this example is from a project I was working on.
a.out <- amelia(dat[sub1, varIndex], idvars = "SCH_ID",
noms = varIndex[!varIndex %in% c("SCH_ID", "math12")],
m = 10)
a.out
is the imputation object, now we need to run the model on each imputed dataset. To do this, we use the lapply
function in R to repeat a function over list elements. This function applies the function -- which is the model specification -- to each dataset (d) in the list and returns the results in a list of models.
mods <- lapply(a.out$imputations,
function(d) lmer((log(wage) ~ gender + age + age_sqr +
occupation + degree + private_sector + overtime +
(1+gender|faculty), data = d)
Now we create a data.frame from that list, by simulating the values the fixed and random effects using the functions FEsim and REsim from the merTools package
imputeFEs <- ldply(mods, FEsim, nsims = 1000)
imputeREs <- ldply(mods, REsim, nsims = 1000)
The data.frames above include separate estimates for each dataset, now we need to combine them using a collapse like argument collapse
imputeREs <- ddply(imputeREs, .(X1, X2), summarize, mean = mean(mean),
median = mean(median), sd = mean(sd),
level = level[1])
imputeFEs <- ddply(imputeFEs, .(var), summarize, meanEff = mean(meanEff),
medEff = mean(medEff), sdEff = mean(sdEff))
Now we can also extract some statistics on the variance/covariance for the random effects across the imputed values. Here I have written two simple extractor functions to do this.
REsdExtract <- function(model){
out <- unlist(lapply(VarCorr(model), attr, "stddev"))
return(out)
}
REcorrExtract <- function(model){
out <- unlist(lapply(VarCorr(model), attr, "corre"))
return(min(unique(out)))
}
And now we can apply them to the models and store them as a vector:
modStats <- cbind(ldply(mods, REsdExtract), ldply(mods, REcorrExtract))
Update
The functions below will get you much closer to the output provided by arm::display
by operating on the list of lmer
or glmer
objects. Hopefully this will be incorporated into the merTools
package in the near future:
# Functions to extract standard deviation of random effects from model
REsdExtract <- function(model){
out <- unlist(lapply(VarCorr(model), attr, "stddev"))
return(out)
}
#slope intercept correlation from model
REcorrExtract <- function(model){
out <- unlist(lapply(VarCorr(model), attr, "corre"))
return(min(unique(out)))
}
modelRandEffStats <- function(modList){
SDs <- ldply(modList, REsdExtract)
corrs <- ldply(modList, REcorrExtract)
tmp <- cbind(SDs, corrs)
names(tmp) <- c("Imp", "Int", "Slope", "id", "Corr")
out <- data.frame(IntSD_mean = mean(tmp$Int),
SlopeSD_mean = mean(tmp$Slope),
Corr_mean = mean(tmp$Corr),
IntSD_sd = sd(tmp$Int),
SlopeSD_sd = sd(tmp$Slope),
Corr_sd = sd(tmp$Corr))
return(out)
}
modelFixedEff <- function(modList){
require(broom)
fixEst <- ldply(modList, tidy, effects = "fixed")
# Collapse
out <- ddply(fixEst, .(term), summarize,
estimate = mean(estimate),
std.error = mean(std.error))
out$statistic <- out$estimate / out$std.error
return(out)
}
print.merModList <- function(modList, digits = 3){
len <- length(modList)
form <- modList[[1]]@call
print(form)
cat("\nFixed Effects:\n")
fedat <- modelFixedEff(modList)
dimnames(fedat)[[1]] <- fedat$term
pfround(fedat[-1, -1], digits)
cat("\nError Terms Random Effect Std. Devs\n")
cat("and covariances:\n")
cat("\n")
ngrps <- length(VarCorr(modmathG[[1]]))
errorList <- vector(mode = 'list', length = ngrps)
corrList <- vector(mode = 'list', length = ngrps)
for(i in 1:ngrps){
subList <- lapply(modList, function(x) VarCorr(x)[[i]])
subList <- apply(simplify2array(subList), 1:2, mean)
errorList[[i]] <- subList
subList <- lapply(modList, function(x) attr(VarCorr(x)[[i]], "corre"))
subList <- min(unique(apply(simplify2array(subList), 1:2, function(x) mean(x))))
corrList[[i]] <- subList
}
errorList <- lapply(errorList, function(x) {
diag(x) <- sqrt(diag(x))
return(x)
})
lapply(errorList, pfround, digits)
cat("\nError Term Correlations:\n")
lapply(corrList, pfround, digits)
residError <- mean(unlist(lapply(modList, function(x) attr(VarCorr(x), "sc"))))
cat("\nResidual Error =", fround(residError,
digits), "\n")
cat("\n---Groups\n")
ngrps <- lapply(modList[[1]]@flist, function(x) length(levels(x)))
modn <- getME(modList[[1]], "devcomp")$dims["n"]
cat(sprintf("number of obs: %d, groups: ", modn))
cat(paste(paste(names(ngrps), ngrps, sep = ", "),
collapse = "; "))
cat("\n")
cat("\nModel Fit Stats")
mAIC <- mean(unlist(lapply(modList, AIC)))
cat(sprintf("\nAIC = %g", round(mAIC, 1)))
moDsigma.hat <- mean(unlist(lapply(modmathG, sigma)))
cat("\nOverdispersion parameter =", fround(moDsigma.hat,
digits), "\n")
}
Best Answer
There are 2 issues here: the handling of "outliers" and what may be a misunderstanding of how to use multiple imputation.
As someone who regularly does survival analysis in a biomedical setting, I completely agree with @DWin that you should not remove outliers (unless you know that there were errors in collecting those data points, in which case they should not have been entered into the analysis in any event; and if you are going to remove data points on the basis of errors in data collection, make sure you examine all the data points, not just the ones that seem to be outliers). The reasons that those observations appear to be outliers might be crucial to understanding the underlying problem. You may be doing yourself a major disservice if you just throw them away because a particular algorithm deemed them "outliers." It's even possible that their appearance as outliers might be removed in analyses of multiple imputations.
Second, when you say "How do you derive model fits (e.g. pseudo R-squared, log-likelihood, AIC) with pooled imputation data?" it seems that you are trying to do a single model fit on a pool of your multiple imputations. What you instead do is fit your model individually to each of your multiple imputed data sets. The results of the multiple models are then pooled; the differences in results among the different models/imputations help indicate the variation due to the imputation process. The
rms
package in R has facilities for handling survival analysis of multiply imputed data sets.As with any multiple imputation, make sure that you have confidence that the missing data are truly "missing at random" in the technical sense used for this approach: that the probability of an observation's being missing does not depend on its actual value. Were you to proceed by marking "outliers" as NA and then imputing their values, note that you would be directly violating that requirement for reliable multiple imputation. So you can't proceed in that way, even if you are still convinced that the data points are "outliers."