Regression Diagnostics – Multiple Imputation and Regression Model Diagnostics

missing datar

When I run regression analysis I find it important to run some model diagnostics, such as detection of outliers, influential observations, multi-collinearity (much like these examples http://www.statmethods.net/stats/rdiagnostics.html).

Example of Diagnostics I use:

#Assessing the Assumption of Independence, using Durbin Watson Test
dwt(lmModel)

#Controlling for Multicollinearity
vif(lmModel)
1/vif(lmModel)
mean(vif(lmModel))

I have a sample with a lot of missing data across most variables. Thus, I need to use multiple imputations.

However, model diagnostics seems to be impossible to explore when using multiple imputations. So far, I have used the mice package and since I am still a novice at R my multiple imputation script basically looks like this:

#Imputes 5 datasets    
imp <- mice(myData, m=5)    

#Runs regression analysis on each imputed dataset    
fit <- with(imp, lm(A ~ B + C))   

#Pools the results
pooled <- pool(fit)
summary(pooled)

Is there some way to use the diagnostic test on the pooled data? or do I have to use diagnostic tests on each imputed dataset (before being pooled)? or is there some other smart way of solving this issue?

Thanks for your time

Best Answer

The data are usually not pooled when MI is performed, at least not according to the paradigm described by Rubin (1987). Rather what's being pooled is the parameter estimates obtained from each dataset.

There are several ways to approach your question:

  1. You could perform regression diagnostics for your original datasets, depending on what you wish to examine. For example, identifying outliers might be well possible using the incomplete data alone.

  2. You may also perform regression diagnostics on each of the imputed datasets separately, that is calculate the VIF individually for each dataset. I would suggest against pooling these values, but you should get a good idea about whether multicollinearity is a problem or not.

  3. You may ignore the problem. This sounds a bit silly, but it's not unjustified considering that the actual consequences of multicollinearity are still a topic of discussion. The main effect of multicollinearity in complete-data regression analyses is the inflation of standard errors. In turn, the standard errors of the individual dataset are one component of the variance of the MI estimate. Thus, if your complete-data estimates would suffer from multicollinearity, then your pooled estimates will too, making these situations not actually that different.

Another side effect of multicollinearity might be that the sampling method (the Gibbs sampler implemented in mice) needs a longer time to converge or shows larger degrees of autocorrelation.

Unfortunately, the mice package is not very well suited to monitor the sampling of the parameters of the imputation model (the mice-algorithm also makes that impractical because there is lots of parameters). As an alternative, van Buuren suggests to monitor means, standard deviations or other descriptive measures of the imputed variables instead.

This is however, a completely different question. You might find info on that on the internet or in the package documentation. I also suggest the article by van Buuren and Groothuis-Oudshoorn in the Journal of Statistical Software (2011).

Related Question