Missing Data – Understanding Missing Rates and Multiple Imputation Techniques

data-imputationmissing data

Is there a limit which is the least acceptable when using multiple imputation (MI)?

For example can I use MI if the missing values in a variable are the 20% of the cases while and other variables have missing values but not to such a high level?

Best Answer

From the comments, you're confident that your in a MAR or MCAR situation. Then multiple imputation is at least reasonable. So how much missingness is tractable? Think of it this way:

Basically, multiple imputation makes all your model parameter estimates less certain as a function of the accuracy with which the missing data can be predicted with your imputation model, which will depend, among other things, on the amount of missing that needs imputing, and the number of imputations you use.

How much is 'too much' missingness therefore depends on how much added variance/uncertainty you are willing to put up with. A useful quantity for you might be the relative efficiency ($RE$) of an MI analysis. This depends on the 'fraction of missing information' (not the simple rate of missingness), usually called $\lambda$, and the number of imputations, usually called $m$, as $RE \approx 1/(1+\lambda/m)$.

Rather than generate the definitions of missing information etc. here, you might simply read the MI FAQ which puts things very clearly. From there you'll know whether you want to tackle the original sources: Rubin etc.

Practically speaking you should probably just try an imputation analysis and see how it works out.

Related Question