Imputation – The Advantage of Imputation Over Building Multiple Regression Models

I wonder if someone could provide some insight into if an why imputation for missing data is better than simply building different models for cases with missing data. Especially in the case of [generalized] linear models (I can perhaps see in non-linear cases things are different)

Suppose we have the basic linear model:

$ Y = \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon$

But our data set contains some records with $X_3$ missing. In the prediction data set where the model will be used there will also be cases of missing $X_3$. There seem to be two ways to proceed:

Multiple models

We could split the data into $X_3$ and non-$X_3$ cases and build a separate model for each. If we suppose that $X_3$ is closely related to $X_2$ then the missing data model can overweight $X_2$ to get the best two-predictor prediction. Also if the missing data cases are slightly different (due to the missing data mechanism) then it can incorporate that difference. On the down side, the two models are fitting on only a portion of the data each, and aren't "helping" each other out, so the fit might be poor on limited datasets.

Imputation

Regression multiple imputation would first fill in $X_3$ by building a model based on $X_1$ and $X_2$ and then randomly sampling to maintain the noise in the imputed data. Since this is again two models, will this not just end up being the same as the multiple model method above? If it is able to outperform – where does the gain come from? Is it just that the fit for $X_1$ is done on the entire set?

EDIT:

While Steffan's answer so far explains that fitting the complete case model on imputed data will outperform fitting on complete data, and it seems obvious the reverse is true, there is still some misunderstanding about the missing data forecasting.

If I have the above model, even fitted perfectly, it will in general be a terrible forecasting model if I just put zero in when predicting. Imagine, for example, that $X_2 = X_3+\eta$ then $X_2$ is completely useless ($\beta_2 = 0$) when $X_3$ is present, but would still be useful in the absence of $X_3$.

The key question I don't understand is: is it better to build two models, one using $(X_1, X_2)$ and one using $(X_1, X_2, X_3)$, or is it better to build a single (full) model and use imputation on the forecast datasets – or are these the same thing?

Bringing in Steffan's answer, it would appear that it is better to build the complete case model on an imputed training set, and conversely it is probably best to build the missing data model on the full data set with $X_3$ discarded. Is this second step any different from using an imputation model in the forecasting data?

Best Answer

I think the key here is understanding the missing data mechanism; or at least ruling some out. Building seperate models is akin to treating missing and non-missing groups as random samples. If missingness on X3 is related to X1 or X2 or some other unobserved variable, then your estimates will likely be biased in each model. Why not use multiple imputation on the development data set and use the combined coefficients on a multiply imputed prediction set? Average across the predictions and you should be good.

Best Answer

Related Solutions

Solved – Multiple imputation on single subscale item or subscale scores

Solved – perform Random Forest AFTER multiple imputation with MICE

Related Question