Solved – Calculating pooled p-values manually

micemultiple-imputationr

For reasons I won't go into I need to calculate parameter estimates from several imputed datasets. Based on this CV post about Rubin's rules I have determined how to manually calculate both the pooled coefficient and standard error. However the method for the p-value eludes me. If it were up to me I would make do with the pooled coefficient and se, but I know my boss will want the p-value.

Here is the toy analysis.

First the imputation.

require(mice)
set.seed(123)
nhimp <- mice(nhanes)
v <- sapply(1:5, function(i) {
  fit <- lm(chl ~ bmi, data=complete(nhimp, i))
  print(c('coef'=coef(fit)[2], 'var'=vcov(fit)[2, 2], 'p'=summary(fit)$coefficients["bmi",4]))
})

Create a matrix of extracted estimates from the model applied to each of the five complete imputed datasets. 1st column are the coefficients, 2nd column are the variances, 3rd column are the p-values.

(mat <- t(v))

#      coef.bmi      var          p
# [1,] 2.195180 5.467482 0.35758491
# [2,] 4.231113 3.603410 0.03587456
# [3,] 3.470647 5.475586 0.15159999
# [4,] 2.937763 4.776171 0.19198054
# [5,] 2.208305 2.972943 0.21304488

Now for the pooled estimates. The pooled estimate of the regression coefficients is easy: it's just the mean of the coefficients (first column

(pooledMean <- mean(mat[,1]))

# [1] 3.008602

Calculating the pooled estimate of the standard error is a bit more tricky, but still relatively simple. Total variance is the sum of within-variance and between-variance*degrees-of-freedom-correction.

Within variance is the average of the imputation specific point estimate variances

(withinVar <- mean(mat[,2])) # mean of variances
# [1] 4.459118

Between variance is the variance of the coefficients (variance of first column, or sd of first column squared.

(betweenVar <- sd(mat[,1])^2) # variance of coefficients
# [1] 0.7537916

The degrees of freedom correction is (m+1)/m where m is the number of imputations

(dfCorrection <- (nrow(mat)+1)/(nrow(mat))) # dfCorrection
# [1] 1.2

Now we can calculate total variance

(totVar <- withinVar + betweenVar*dfCorrection) 
# [1] 5.363668

The pooled standard error is just the square root of the total variance

(pooledSE <- sqrt(totVar)) # standard error
# [1] 2.315959

Now is the part I don't know: how to get the pooled estimate of the p-value

(pooledP <- mean(mat[,3])) #??????
# [1] 0.190017

Put them all together

(pooledEstimates <- round(c(pooledMean, pooledSE, pooledP),5))
# [1] 3.00860 2.31596 0.19002

These should be exactly the same as the pooled values for these parameters returned by mice

fit <- with(data=nhimp,exp=lm(chl~bmi))
summary(pool(fit))
#          term   estimate std.error statistic       df   p.value
# 1 (Intercept) 111.958092 61.373512  1.824209 15.93028 0.0869345
# 2         bmi   3.008602  2.315959  1.299074 15.68225 0.2126945

The manually calculated pooled coefficient and se are the same as those yielded by the pool() function; but not the p-value. Can anyone explain simply the way mice calculates the pooled p-value? This post explains how to do it with software but I need to calculate it manually.

Best Answer

This is for anyone who is interested, after reading pp. 37-43 in Flexible Imputation of Missing Data by Stef van Buuren. If we call the adjusted degrees of freedom nu

  m <- nrow(mat)
  lambda <- (betweenVar + (betweenVar/m))/totVar
  n <- nrow(nhimp$data)
  k <- length(coef(lm(chl~bmi,data = complete(nhimp,1))))
  nu_old <- (m-1)/lambda^2  
  nu_com <- n-k
  nu_obs <- (nu_com+1)/(nu_com+3)*nu_com*(1-lambda)
  (nu_BR <- (nu_old*nu_obs)/(nu_old+nu_obs))
  # [1] 15.68225

nu_BR, the Barnard_Rubin adjusted degrees of freedom, matches up with the degrees of freedom for the bmi variable yielded from the the summary(pool(fit)) call above: 15.68225. So we can pass this value into degrees of freedom argument in the pt() function in order to obtain the two-tailed p-value for the imputed model.

pt(q = pooledMean / pooledSE, df = nu_BR, lower.tail = FALSE) * 2
# [1] 0.2126945

And this manually calculated p-value now matches the p-value from the mice function output.

Related Solutions

Solved – Pooling imputed, still not analysed datasets in MICE

A major point of multiple imputations is to do separate analyses on each of the imputed data sets, so that you can get both pooled estimates of things like regression coefficients and an estimate of the errors in the coefficients. Averaging the imputed data sets first is not the correct use of this approach. And don't limit yourself to so few imputations; with modern computers there's no reason not to do 100 or more. See http://www.stefvanbuuren.nl/mi/MI.html, from the person who developed the mice package, for further information.

Solved – lmer with multiply imputed data

You can do this somewhat by hand if by taking advantage of the lapply functionality in R and the list-structure returned by the Amelia multiple imputation package. Here's a quick example script.

library(Amelia)
library(lme4)
library(merTools)
library(plyr) # for collapsing estimates

Amelia is similar to mice so you can just substitute your variables in from the mice call here -- this example is from a project I was working on.

 a.out <- amelia(dat[sub1, varIndex], idvars = "SCH_ID", 
            noms = varIndex[!varIndex %in% c("SCH_ID", "math12")], 
            m = 10)

a.out is the imputation object, now we need to run the model on each imputed dataset. To do this, we use the lapply function in R to repeat a function over list elements. This function applies the function -- which is the model specification -- to each dataset (d) in the list and returns the results in a list of models.

 mods <- lapply(a.out$imputations,
           function(d) lmer((log(wage) ~ gender + age + age_sqr + 
            occupation + degree + private_sector + overtime + 
             (1+gender|faculty), data = d)

Now we create a data.frame from that list, by simulating the values the fixed and random effects using the functions FEsim and REsim from the merTools package

imputeFEs <- ldply(mods, FEsim, nsims = 1000)
imputeREs <- ldply(mods, REsim, nsims = 1000)

The data.frames above include separate estimates for each dataset, now we need to combine them using a collapse like argument collapse

imputeREs <- ddply(imputeREs, .(X1, X2), summarize, mean = mean(mean), 
               median = mean(median), sd = mean(sd), 
               level = level[1])

imputeFEs <- ddply(imputeFEs, .(var), summarize, meanEff = mean(meanEff), 
               medEff = mean(medEff), sdEff = mean(sdEff))

Now we can also extract some statistics on the variance/covariance for the random effects across the imputed values. Here I have written two simple extractor functions to do this.

REsdExtract <- function(model){
  out <- unlist(lapply(VarCorr(model), attr, "stddev"))
  return(out)
}

REcorrExtract <- function(model){
  out <- unlist(lapply(VarCorr(model), attr, "corre"))
  return(min(unique(out)))
}

And now we can apply them to the models and store them as a vector:

modStats <- cbind(ldply(mods, REsdExtract), ldply(mods, REcorrExtract))

Update

The functions below will get you much closer to the output provided by arm::display by operating on the list of lmer or glmer objects. Hopefully this will be incorporated into the merTools package in the near future:

# Functions to extract standard deviation of random effects from model
REsdExtract <- function(model){
  out <- unlist(lapply(VarCorr(model), attr, "stddev"))
  return(out)
}

#slope intercept correlation from model
REcorrExtract <- function(model){
  out <- unlist(lapply(VarCorr(model), attr, "corre"))
  return(min(unique(out)))
}

modelRandEffStats <- function(modList){
  SDs <- ldply(modList, REsdExtract)
  corrs <- ldply(modList, REcorrExtract)
  tmp <- cbind(SDs, corrs)
  names(tmp) <- c("Imp", "Int", "Slope", "id", "Corr")
  out <- data.frame(IntSD_mean = mean(tmp$Int), 
                        SlopeSD_mean = mean(tmp$Slope), 
                    Corr_mean = mean(tmp$Corr), 
                        IntSD_sd = sd(tmp$Int),
                    SlopeSD_sd = sd(tmp$Slope), 
                        Corr_sd = sd(tmp$Corr))
  return(out)
}

modelFixedEff <- function(modList){
  require(broom)
  fixEst <- ldply(modList, tidy, effects = "fixed")
  # Collapse
  out <- ddply(fixEst, .(term), summarize,
               estimate = mean(estimate), 
               std.error = mean(std.error))
  out$statistic <- out$estimate / out$std.error
  return(out)
}

print.merModList <- function(modList, digits = 3){
  len <- length(modList)
  form <- modList[[1]]@call
  print(form)
  cat("\nFixed Effects:\n")
  fedat <- modelFixedEff(modList)
  dimnames(fedat)[[1]] <- fedat$term
  pfround(fedat[-1, -1], digits)
  cat("\nError Terms Random Effect Std. Devs\n")
  cat("and covariances:\n")
  cat("\n")
  ngrps <- length(VarCorr(modmathG[[1]]))
  errorList <- vector(mode = 'list', length = ngrps)
  corrList <- vector(mode = 'list', length = ngrps)
  for(i in 1:ngrps){
    subList <- lapply(modList, function(x) VarCorr(x)[[i]])
    subList <- apply(simplify2array(subList), 1:2, mean)
    errorList[[i]] <- subList
    subList <- lapply(modList, function(x) attr(VarCorr(x)[[i]], "corre"))
    subList <- min(unique(apply(simplify2array(subList), 1:2, function(x) mean(x))))
    corrList[[i]] <- subList
  }
  errorList <- lapply(errorList, function(x) {
    diag(x) <- sqrt(diag(x))
    return(x)
    })

  lapply(errorList, pfround, digits)
  cat("\nError Term Correlations:\n")
  lapply(corrList, pfround, digits)
  residError <- mean(unlist(lapply(modList, function(x) attr(VarCorr(x), "sc"))))
  cat("\nResidual Error =", fround(residError,
                                             digits), "\n")
  cat("\n---Groups\n")
  ngrps <- lapply(modList[[1]]@flist, function(x) length(levels(x)))
  modn <- getME(modList[[1]], "devcomp")$dims["n"]
  cat(sprintf("number of obs: %d, groups: ", modn))
  cat(paste(paste(names(ngrps), ngrps, sep = ", "),
            collapse = "; "))
  cat("\n")
  cat("\nModel Fit Stats")
  mAIC <- mean(unlist(lapply(modList, AIC)))
  cat(sprintf("\nAIC = %g", round(mAIC, 1)))
  moDsigma.hat <- mean(unlist(lapply(modmathG, sigma)))
  cat("\nOverdispersion parameter =", fround(moDsigma.hat,
                                             digits), "\n")
}

Best Answer

Related Solutions

Solved – Pooling imputed, still not analysed datasets in MICE

Solved – lmer with multiply imputed data

Related Question