Solved – Multiple Imputations and Survival Analysis

multiple-imputationrsurvival

I’m new to using multiple imputations and I would like an opinion on using it with survival analysis in R. I am using MICE on an entire dataset. For one of my independent variables I decided to place an NA for the observations that were outliers influencing my results as determined by diagnostics. I decided to impute because there are a large number of missing observations on a different independent variable. Is it appropriate to impute data that I coded as NA because they were influential outliers?

I am estimating a Weibull AFT survival model. How do you derive model fits (e.g. pseudo R-squared, log-likelihood, AIC) with pooled imputation data? Finally, how do you pull the scale parameter?

BTW, this is the code I am using to pull my pooled results

summary(pool(fitm))

Best Answer

There are 2 issues here: the handling of "outliers" and what may be a misunderstanding of how to use multiple imputation.

As someone who regularly does survival analysis in a biomedical setting, I completely agree with @DWin that you should not remove outliers (unless you know that there were errors in collecting those data points, in which case they should not have been entered into the analysis in any event; and if you are going to remove data points on the basis of errors in data collection, make sure you examine all the data points, not just the ones that seem to be outliers). The reasons that those observations appear to be outliers might be crucial to understanding the underlying problem. You may be doing yourself a major disservice if you just throw them away because a particular algorithm deemed them "outliers." It's even possible that their appearance as outliers might be removed in analyses of multiple imputations.

Second, when you say "How do you derive model fits (e.g. pseudo R-squared, log-likelihood, AIC) with pooled imputation data?" it seems that you are trying to do a single model fit on a pool of your multiple imputations. What you instead do is fit your model individually to each of your multiple imputed data sets. The results of the multiple models are then pooled; the differences in results among the different models/imputations help indicate the variation due to the imputation process. The rms package in R has facilities for handling survival analysis of multiply imputed data sets.

As with any multiple imputation, make sure that you have confidence that the missing data are truly "missing at random" in the technical sense used for this approach: that the probability of an observation's being missing does not depend on its actual value. Were you to proceed by marking "outliers" as NA and then imputing their values, note that you would be directly violating that requirement for reliable multiple imputation. So you can't proceed in that way, even if you are still convinced that the data points are "outliers."