Solved – How to select the best dataset after multiple imputation in MICE to build other models

micemissing datar

I carried out multiple imputation using MICE with m=10. The R code is shown below:

RainfallData <- mice(rainfall,m=10,maxit=10,meth='pmm')

modelFit1 <- with(RainfallData,lm(Total.Rainfall~Wind.Direction+Hor.Windspeed+Solar.Radiation+Baro.Pressure+Vpr.Pressure+Rel.Humidity+Air.Temp))

pool(modelFit1)

summary(pool(modelFit1))

completedData <- complete(RainfallData,action = "long")

My question is how shall I select the best complete dataset out of 10 datasets (m=10) that provides the best estimated values for missing values? I need to use this dataset for further analysis.

Should I take the averages of the values from 10 completed dataset and build one complete dataset? Or shall I just randomly select any out of 10?

In my case, only 2.8% of the data are missing for each variable. I can consider Complete Case Analysis but I would like to study time series model and would like to fill the missing values. Both dependent and independent variables have missing data. The missing data is MCAR.

Please help me. I am really confused.

Best Answer

You should fit your model to each of the multiple imputations and then combine the results (e.g. using Rubin's rule). That way the uncertainty about your final analysis result does not just come from the sampling variability of the chosen probability distribution, but also from how much the results from the different imputed datasets differ. That appropriately reflects the uncertainty about what the missing data might have been.

If you average the results from fewer than 3-5 imputation (e.g. by using just one imputation), you get none of the nice properties of MI. E.g. your standard errors will be too small and you get type I error inflation. If you pick 1 imputation based on done model fit statistics, I would expect this to get even worse.

10 imputations is a relatively low number and if it does not take too long, I would normally do at least 250 or so. Doing so often makes your standard errors a little smaller and makes the results less dependent on your random number seed.