Solved – Mean comparisons following multiple imputation

f-testmultiple-imputationrregressiont-test

I need to do some simple mean comparisons between groups (basic ANOVA F-tests) on data with missing values. I use the mice package in R for multiple imputation, but I can only pool results for the linear model coefficients, or the $R^2$.

Does anyone know how to combine to pool multiple F-statistics from each linear model fit? Or, how can I compute the standard errors for the F-test?

Best Answer

A recent paper by van Ginkel & Kroonenberg works out the details of pooling F-tests and other ANOVA results. The paper is:

van Ginkel, J. R., & Kroonenberg, P. M. (2014). Analysis of Variance of Multiply Imputed Data. Multivariate Behavioral Research, 49(1), 78-91.

and van Ginkel's website (http://www.socialsciences.leiden.edu/educationandchildstudies/childandfamilystudies/organisation/staffcfs/van-ginkel.html) has SPSS macros with instruction files. As far as I know, their formulae have not yet been implemented in R.

@Brian, if you do write a function, please share!

Related Solutions

Solved – How to summarize GAM model result from multiple imputation data in R

First, GAM's are basically a combination of GLMs, splines, and ridge regression (loosely speaking). You might be well advised to work on your understanding of these three things before attempting to work with their combination.

Second, I've done multiple imputation with GAMs, but not with mice.

In general, you're interested in fitting a model: $$ y = f(\mathbf{X}) + \epsilon $$ (in the gaussian, identity-linked case). $\mathbf{X}$ has holes in it -- missing data. Without going into the detail about how multiple imputation works (Gelman and Hill have a very good and less-technical chapter explaining missing data imputation), what it does is create many different $\mathbf{X}$'s, call them $\mathbf{X}_m$ where $m$ indexes imputed datasets from 1 to the number of imputations, with the missing values filled in with plausible guesses about what the missing values might be -- based on correlations in your dataset. You then fit your model $$ y = f(\mathbf{X}_m) + \epsilon $$ to each of those $m$ partially-imputed datasets. Since you're fitting $m$ models to $m$ different datasets, you're going to get $m$ different vectors of regression coefficients. These are combined according to "Rubin's Rules" (after the statistician who invented MI). Basically you average the coefficients; $$\hat\beta = \frac{1}{M}\displaystyle\sum_{m=1}^M \hat\beta_m$$ The variance-covariance matrix $\hat V_\beta$ of the estimated parameters is calculated by first averaging variance-covariance matrices, and then adding a correction to account for variation between imputation models: \begin{equation} \hat V_{\hat\beta} = W + \left(1 + \frac{1}{M} \right)B \end{equation} where $W = \frac{1}{M}\displaystyle\sum_{m=1}^M \widehat{VCV}_m$, $\widehat{VCV}$ is the estimated variance-covariance matrix of the estimated parameters, and $B = \frac{1}{M-1}\displaystyle\sum_{m=1}^M \left(\hat\beta_m - \hat{\bar\beta}\right)\left(\hat\beta_m - \hat{\bar\beta}\right)^T$. This procedure inflates the standard errors on coefficients about which the imputation model is relatively less certain, either due to a lot of missing data, or due to a poorly-informative imputation model.

You can then use the vector of coefficients and the VCV as if they were gotten from a single complete-case model.

OP's question in a comment makes me realize that the below isn't quite right! Read EDIT2 below to see why

So what is different about doing this with a GAM? Only a couple of things. Your variables represented by smooth functions are associated with a number of coefficients (mgcv's default is 10). During estimation they are subject to ridge penalties, which smooths the estimated function. But otherwise they can be slotted into Rubin's rules just like the parametric coefficients. Each of the different $m$ models will have different estimated smoothing parameters, as they are estimated from the data. I don't think there is a problem with this, though I'd be interested to hear if someone has another perspective. You do want to take care that you pre-specify the location of the knots of each spline so that they are uniform across the $m$ models -- otherwise your coefficients won't be comparable, and combining them won't be appropriate.

Once you have calculated the combined coefficients and VCV, you can simply stuff them into a GAM object (one of your $m$ models). Functions that summarize data and make plots draw from those two objects, and will thereby use your imputation estimates rather than those from the $m$th model. There is code online illustrating the whole process from this paper.

EDIT I just saw the comment about how you've got your models fitted to your imputed datasets already. If you can coerce those into a list object with the 5 GAM models, then you can run something like the following to combine them:

bhat=results[[1]]$coeff
    for (i in 2:reps){
    	bhat=bhat+results[[i]]$coeff
    }
bhat = bhat/reps

W=results[[1]]$Vp
    for (i in 2:reps){
    	W = W+results[[i]]$Vp
    	}
    W = W/reps
    B= (results[[1]]$coeff-bhat) %*% t(results[[1]]$coeff-bhat)
    for (i in 2:reps){
    	B = B+(results[[i]]$coeff-bhat) %*% t(results[[i]]$coeff-bhat)
    }
B=B/(reps-1)

Vb = W+(1+1/reps)*B

dfr=results[[1]]$df.residual
    for (i in 2:reps){
    	dfr=dfr+results[[i]]$df.residual
    }
dfr = dfr/reps

MI = results[[1]]

MI$coefficients=bhat
    MI$Vp = Vb
MI$df.residual = dfr

EDIT2 OP's question in a comment about p-values makes me realize that the above isn't quite right. It should get you appropriate coefficient and VCV estimates. But p-values are based on a reduced-rank Wald statistic that relies on the model matrix for its calculation. Obviously the model matrix will differ between different imputations. So if you take a single one of many gam objects (which will all have their own model frames) and stuff the VCV and coefficient vectors into it, you won't get the same result as you would if you chose a different one of your models fit to a different imputed dataset. I'm not sure how Rubin's Rules would be generalized to combine imputations here! Maybe a short-term hack would be to also average the model R matrices from each model? This should be about as valid as averaging the effective degrees of freedom (which may also be a hack).

Solved – perform Random Forest AFTER multiple imputation with MICE

This is not a direct answer to your question, and I don't have enough reputation to comment, but one thing you can do is use the Machine Learning in R package. There are many random forest learner implementations there that can use data with missing values. You can also tune the learners based on what your dataset is.

Links to the package and documentation are on the main tutorial page, here:

https://mlr-org.github.io/mlr-tutorial/release/html/index.html

Also, consider that answering your question becomes much easier if you provide a sample of your dataset.

If you need a direct answer, looping a series of RF calls on the imputed datasets might work. E.g. if you have five imputations:

res = data.frame(matrix(0,nrow=nrow(test),ncol=5)
for (i in 1:5){
  data = complete(miceResult, 1)
  rf.res = cforest(data,formula ~ [which formula?])
  res[,i] = predict(rf.res, test)
}

Then you can pool the results by majority voting or averaging, depending on your dataset. You can also group the 5 imputations together and train the learner with the combined dataset. Both methods are suboptimal, however.

Hope this helps.

Best Answer

Related Solutions

Solved – How to summarize GAM model result from multiple imputation data in R

Solved – perform Random Forest AFTER multiple imputation with MICE

Related Question