Solved – Imputation of a censored variable

censoringdata-imputationepidemiologymissing datar

I have a medical dataset with approx 200 variables. One of the variables is a bio-marker (concentration of a particular enzyme). It's distribution is right skew, and the problem is that values above a certain level are censored/cut off at that level. So while the mean of the variable is around 10, any values greater than 50 are recorded as 50.

I would like to impute continuous values for those censored values. I am using multiple imputation with the mice package in R at present, though other systems are available to me and I am open to other approaches. A thought I had was to recode all those censored values to be missing and then running the imputations. If any of the imputed values that were originally censored are below the cut-off, then they will then be assigned to be the cut-off value.

I'd like to know opinions about this, and/or any better methods of dealing with this.

Best Answer

Any method of imputation including multiple imputation is a shot in the dark if you can't take acoount of how the data above 50 are distributed. Since you have 200 variables are any of them correlated with the biomarker? If you could fit a regression for the biomarker as a function of the covariates you could use that model to predict the values for the truncated ones. You could apply an error to the prediction based on the residual variance in the model to generate multiple imputations that way. It would be more sensible. Of course this assumes you can find a valid model and that the residuals have zero mean and constant variance. You would only fit then non-truncated biomarker values to construct the model.

Related Question