Data Imputation – How to Use Restricted Cubic Splines with the R Mice Imputation Package

data-imputationmultiple-imputationsplines

I am wondering how to integrate restricted cubic splines (such as in the rms package) in the imputation models within R mice imputation package.

Context: I am doing biomedical research and have access to a dataset consisting of patient characteristics and data about the patient's disease progression, next to outcomes after medical care (e.g. one year survival). The goal is to build a prediction model based on the patient characteristics and disease progression in order to predict occurrence of certain outcomes.
Alas, some patients do not have full information on all the variables. As such I've decided to use multiple imputation techniques to estimate (multiple times) what these missing values would be.

Problem: When using multiple imputation there's this 'rule' called congeniality. This means imputation requires the statistical model used for the final analysis (i.e. the prediction model I want to study) should also be included in the imputation model (preferably with additional information added to it). This also means taking into account possible non-linear associations. As I do not know whether certain predictors have non-linear associations with others, I'd like the imputation models to be able to fit restricted cubic splines. However, I do not really grasp how to do this in mice. I would therefore like help in creating imputation models allowing for rcs, suitable for mice.


On a sidenote to any moderators: I thought this question was suited for Crossvalidated as imputation and splines are specific 'statistical' subjects. However, due to the focus on the programming nature of this 'how to' question I wouldn't mind the question being migrated if you think it is more suitable elsewhere.
Following this doubt, I also posted this question on StackOverflow (https://stackoverflow.com/questions/45674088/how-to-use-restricted-cubic-splines-with-the-r-mice-imputation-package)

Best Answer

You are right that the imputation model needs to be as rich or richer than the outcome model. The fact that imputation based on full maximum likelihood estimation and imputation done by mice assume linearity everywhere was a prime reason I wrote the R Hmisc package aregImpute function, which creates imputation models automatically using rich additive restricted cubic spline models. So linearity is not assumed for multiple imputation. The default approach in aregImpute is predictive mean matching, which I generally prefer over more parametric approaches (splines are still used; PMM is less parametric on the left hand side of models).

Like mice, aregImpute uses chained equations. Unlike mice, it uses bootstrap draws instead of approximate (assuming multivariate normality) Bayesian posterior draws.

Related Question