Missing Data – What to Report in a Paper on Multiple Imputation

missing datamultiple-imputationreporting

I'm just wondering which results has to be reported in a paper if multiple imputation (MI) has been performed: the estimates (confidence intervals (CI), P-values) from the complete case (CC) or from the MI? In the awesome books of Enders and van Buuren I couldn't find it, although there are guidelines how to report MI-procedure. (If I missed it than I apologize).

I looked at some articles from the review of Rezvan 2015: The rise of multiple imputation. From these some reported the MI-, other CC-estimates and others are not clear.

I concluded for myself that the MI-estimates (odds ratio, CI, P-values) should be reported for the simple reason that I want unbiased estimates as long as MI is appropriate. But what about baseline figures and contigency tables?

Here again the concrete questions (assuming that MI is appropriate):

  1. Which results (odds ratio or mean CI, P-values) has to be reported: results from CC or pooled results from MI?
  2. Let's assume we have a 2×4 contingency table (4 levels: yes,no,don't understand,unsure) and perform a MI getting 10 MI-datasets. We can compute a pooled P-Value using the formula for pooling $\chi^2$-tests (for example: van Buuren, page 159). I want to report the percentage and absolute number (in brackets) of yes, for example, 15% (20). Which should I report: the figures from complete cases or the mean of the 10 yes percentages and counts (considering that the total of all levels should be 100%, respectively total counts with no missing) or just one by chance of the MI-imputations?
  3. Baseline analysis: should be used the complete case data set or the pooled MI dataset?

Best Answer

In general, it is appropriate to report the results of the planned primary analysis, possibly also all or some of the foreseen sensitivity/supportive analyses (depending on space considerations) and potentially additional analyses requested e.g. by peer reviewers (e.g. in case of a pre-specified complete case analysis I would as a reviewer request some more appropriate analysis to be also reported). The results of the MI analysis (estimates, CIs etc. from aggregating the analyses of each imputation) are indeed the logical thing to report in case this is the pre-specified analysis.

Another question is what else to report, I would certainly expect that somewhere in the methods the multiple imputation approach (what variables were entered, was it some kind of imputation model longitudinally for each time point, or jointly across all times using some joint normality, how many imputations etc.) is described. Multiple imputation certainly comes in many flavors and variants and it is important for the reader to be able to find out what was done.

For contingency tables or baseline characteristics, to me the main question is whether you are primarily trying to describe the data descriptively or whether you are seeing it as something that people would compare/making some kind of mental inference on. Both have some value and for the first it may be the most transparent the number of missing or non-missing values in addition to summary statistics of the complete cases (that is certainly very common, especially for baseline characteristics), but as soon as it has more of a "let's compare these between groups" feeling, imputed results may be more appropriate. In either case, one should be transparent about what is being reported. In the contingency table example you mention, the average percentages across all the imputations could be one thing to report.

By the way, 10 imputations is a really low number. It may be enough to ensure type I error control, but by using a much larger number, you avoid that the results depend too much on the pseudorandom number seed you specify and usually gain a bit of power. I tend to go for something like 250 to 1000 by default, if it is not computationally too expensive and there is up to a low double-digit percentage of missing data across time points.