I have read and watched several tutorials about MICE. My confusion is about step 1: creating several copies of the original dataset and imputing different values in each copy. In some tutorials, I have seen random subsamples of the original datasets instead of multiple copies of the whole dataset. I am not sure if these two are talking about the same thing or not.
If the first statement is the correct one, does it mean that we impute one VARIABLE in each of these copies?
If the second statement is correct, how do we choose these random subsamples?
Multiple Imputation – How to Choose MICE Multiple Datasets
data-imputationmicemultiple-imputation
Best Answer
The best freely available resource on this topic is probably Stef van Buuren's Flexible Imputation of Missing Data (FIMD). "MICE" stands for "multiple imputation via chained equations," one particular way to do imputation. It might help to keep that distinction in mind.
Each copy of a multiply imputed data set includes imputations of all missing data points, imputations incorporating randomness to acknowledge the variability in imputation. Each is thus a full data set without missing values. Differences in model results among the imputed data sets are used to estimate the error introduced by imputation.
It's not clear from the question just what you mean by "random subsamples of the original datasets." There is random sampling involved in the MICE algorithm (Section 4.5 of FIMD). That, however, is sampling from the conditional distributions of variables among each other within the data set, not subsampling of cases from data sets. Some analysis methods use random subsamples of cases from data sets for cross validation or bootstrapping, but subsampling isn't an imputation method.
So there really isn't an issue in which imputed data sets to "choose." It is critical to make good choices about how to do the imputations and how many imputed data sets to generate.