Ensemble learning refers to quite a few different methods. Boosting and bagging are probably the two most common ones. It seems that you are attempting to implement an ensemble learning method called stacking. Stacking aims to improve accuracy by combining predictions from several learning algorithms. There are quite a few ways to do stacking and not a lot of rigorous theory. It's intuitive and popular though.
Consider your friend's approach. You are fitting the first layer models on four out of five folds and then fitting the second layer (voting) model using the same four folds. The problem is that the second layer will favor the model with the lowest training error. You are using the same data to fit models and to devise a procedure to aggregate those models. The second layer should combine the models using out-of-sample predictions. Your method is better, but there is a way to do even better still.
We'll continue to leave out one fold for testing purposes. Take the four folds and use 4-fold CV to get out-of-sample predictions for each of your first layer models on all four folds. That is, leave out one of four folds and fit the models on the other three and then predict on the held-out data. Repeat for all four folds so you get out-of-sample predictions on all four folds. Then fit the second layer model on these out-of-sample predictions. Then fit the first layer models again on all four folds. Now you can go to the fifth fold that you haven't touched yet. Use the first layer models fit on all four folds along with the second layer model to estimate the error on the held-out data. You can repeat this process again with the other folds held out of the first and second layer model fitting.
If you are satisfied with the performance then generate out-of-sample predictions for the first layer models on all five folds and then fit the second layer model on these. Then fit the first layer models one last time on all your data and use these with the second layer model on any new data!
Finally, some general advice. You'll get more benefit if your first layer models are fairly distinct from each other. You are on the right path here using SVM and decision trees, which are pretty different from each other. Since there is an averaging effect from the second layer model, you may want to try overfitting your first layer models incrementally, particularly if you have a lot of them. The second layer is generally something simple and constraints like non-negativity of weights and monotonicity are common. Finally, remember that stacking relies on cross-validation, which is only an estimate of the true risk. If you get very different error rates and very different model weights across folds, it indicates that your cv-based risk estimate has high variance. In that case, you may want to consider a simple blending of your first layer models. Or, you can compromise by stacking with constraints on the max/min weight placed on each first layer model.
You can't sum mean squared errors like that unless your variables are all in the same unit, on the same scale. The unit of your RMSE is the square root of the sum of the squared units of its components. This is a completely meaningless unit in most practical applications I can think of.
You could center and rescale all your variables first, to get RMSE in terms of number of standard deviations from the mean. Personally, I'm not sure if this is such a great idea. I think it depends on what you're using this "overall" measure of fit for, since there's no absolute "good" and "bad" RMSE. If you're going to be comparing different imputation models, it might not be a bad approach. Then again, if you're comparing imputation models for the purpose of fitting a model, you're better off (in my source-less opinion) just fitting the model with each imputation method and comparing the final model fits.
The question you linked refers to a "different" overall RMSE. That answer is explaining how to properly average the RMSE's from a cross-validation procedure, on a single variable ($y$ in the answer's notation).
I can't think of any reason to take your simulation's data-generating process into account. The point of simulation studies are to see how your model performs on new data. You don't know the underlying data-generating process. Therefore your estimate of model performance should not take into account things you wouldn't plausibly know when you're fitting your model. I also can't think of how you'd incorporate the missingness stratification if you wanted to, and how to interpret the resulting quantity.
Best Answer
To be correct, you should calculate the overall RMSE as $\sqrt{\frac{RMSE_1^2 + \dots + RMSE_k^2}{k}}$.
Edit: I just got from your question that it may be necessary to explain my answer a bit. The $RMSE_j$ of the instance $j$ of the cross-validation is calculated as $\sqrt{\frac{\sum_i{(y_{ij} - \hat{y}_{ij})^2}}{N_j}}$ where $\hat{y}_{ij}$ is the estimation of $y_{ij}$ and $N_j$ is the number of observations of CV instance $j$. Now the overall RMSE is something like $\sqrt{\frac{\sum_j{\frac{\sum_i{(y_{ij} - \hat{y}_{ij})^2}}{N_j}}}{k}}$ and not what you propose $\frac{\sum_j{\sqrt{\frac{\sum_i{(y_{ij} - \hat{y}_{ij})^2}}{N_j}}}}{\sum_j{N_j}}$.