Solved – Individual and overall RMSE for multivariate data

data-imputationmultivariate analysisrmsstratification

I have a dataset which contains missing values, and I'm using imputation packages (Rs mi and mice) to fill the missing values. I'd like to measure their performance on my data set, which may look like this (my actual data has more rows, but this example should serve just fine):

"var1"  "var2"  "var3"
0.183689466952776   0.388415623304919   -1.3390868493301
NA  -2.0495669969489    NA
NA  NA  1.1107054143715
1.29820089212697    0.736777347408364   -1.19623852541909
-1.17191167149872   -0.744790411450254  -1.96820040415179
-0.686058069998857  NA  NA
0.96219165971458    -1.26927815931595   1.13102353621198
NA  -0.181582994079309  -1.88246768436578
0.133837989978951   -0.298476696697043  0.887731971394049
0.42775517098228    -1.91391435336026   NA
NA  NA  0.0027473853587295
0.605986709105715   0.297153545105678   -1.03855048360928
NA  -1.18987831904712   1.0500895435177
NA  -0.219325915775778  1.54228872681253
NA  NA  -0.976306655339306
NA  NA  0.440861027292491
-1.92738847897133   -0.779770748497074  0.403377851347805
NA  -0.8839601961621    0.0382354592857369
-1.79066885776893   0.723084216521015   0.287610507512217
NA  -2.70392018097682   0.744853382274342

Note the different ratios of missing data (50%, 25%, 10%) and that (in this case, by construction) the pattern of missing data is random.

In order to measure the imputation error, I replace some of the non-NA values in each variable by NA. To keep the structure of the data the same, I opted to replace 10% of the non-NA entries in each variable by NA, i.e. 1, 2 and 2 in var1, var2 and var3, respectively. Other possible ways to create a test set of values would be to pick 10% of all non-missing values (regardless of value), or to pick the same absolute number in every variabe, i.e., say, 1 in every colum.

I measure the quality of the imputation by calculating the RMSE of every column separately.

Is there a way to calculate an overall rmse which takes into account that that my test set was created in a stratified way?

Is it OK to calculate it as $\sqrt{\frac{RMSE_1^2 +RMSE_2^2 + RMSE_3^2 }{3}}$, or does it need to be weighted to reflect the different sample sizes used to calculate the individual RMSEs?

Calculating the root of the mean of the MSEs is something I found in What is the RMSE of k-Fold Cross Validation?, but the answer is quite short why this is the right way to calculate the overall RMSE, and I'm also not sure if this formula from CV is applicable to my case.

Best Answer

You can't sum mean squared errors like that unless your variables are all in the same unit, on the same scale. The unit of your RMSE is the square root of the sum of the squared units of its components. This is a completely meaningless unit in most practical applications I can think of.

You could center and rescale all your variables first, to get RMSE in terms of number of standard deviations from the mean. Personally, I'm not sure if this is such a great idea. I think it depends on what you're using this "overall" measure of fit for, since there's no absolute "good" and "bad" RMSE. If you're going to be comparing different imputation models, it might not be a bad approach. Then again, if you're comparing imputation models for the purpose of fitting a model, you're better off (in my source-less opinion) just fitting the model with each imputation method and comparing the final model fits.

The question you linked refers to a "different" overall RMSE. That answer is explaining how to properly average the RMSE's from a cross-validation procedure, on a single variable ($y$ in the answer's notation).

I can't think of any reason to take your simulation's data-generating process into account. The point of simulation studies are to see how your model performs on new data. You don't know the underlying data-generating process. Therefore your estimate of model performance should not take into account things you wouldn't plausibly know when you're fitting your model. I also can't think of how you'd incorporate the missingness stratification if you wanted to, and how to interpret the resulting quantity.