Solved – Multiple regression with missing predictor variable

missing datamulticollinearitymultiple regressionmultiple-imputationregression

Suppose we are given a set of data of the form $(y,x_{1},x_{2},\cdots, x_{n})$ and $(y,x_{1},x_{2},\cdots, x_{n-1})$. We are given the task of predicting $y$ based on values of $x$. We estimate two regressions where:
$$
\begin{align}
y &=f_{1}(x_{1},\cdots, x_{n-1}, x_{n}) \tag{1} \\
y &=f_{2}(x_{1},\cdots, x_{n-1}) \tag{2}
\end{align}
$$

We also estimate a regression that predicts values of $x_{n}$ based on values of $(x_{1},\cdots, x_{n-1})$, that is:
$$
x_{n}=f_{3}(x_{1},\cdots, x_{n-1}) \tag{3}
$$

Suppose now we are given values of $(x_{1},\cdots, x_{n-1})$, then we would have two different methods to predict $y$:

$$
\begin{align}
y&=f_{1}(x_{1},\cdots, x_{n-1},f_{3}(x_{1},\cdots,x_{n-1})) \tag{4} \\
y&=f_{2}(x_{1},\cdots, x_{n-1}) \tag{5}
\end{align}
$$

Which one would be better in general?

I am guessing that the first equation would be better because it utilizes information from the two forms of data points whereas the second equation utilizes information from only data points that have $n-1$ predictor values. My training in statistics is limited and thus I would like to seek some professional advice.

Also, in general, what is the best approach toward data that have incomplete information? In other words, how can we extract the most information from data that do not have values in all $n$ dimensions?

Best Answer

+1, I think this is a really interesting and clearly stated question. However, more information will help us think through this situation.

For example, what is the relationship between $x_n$ and $y$? It's quite possible that there isn't one, in which case, regression $(1)$ offers no advantage relative to regression $(2)$. (Actually, it is at a very slight disadvantage, in the sense that the standard errors will be slightly larger, and thus betas might be slightly further, on average, from their true values.) If there is a function mapping $x_n$ to $y$, then, by definition, there is real information there, and regression $(1)$ will be better in the initial situation.

Next, what is the nature of the relationship between $(x_1, \cdots, x_{n-1})$ and $x_n$? Is there one? For instance, when we conduct experiments, (usually) we try to assign equal numbers of study units to each combination of values of the explanatory variables. (This approach uses a multiple of the Cartesian product of the levels of the IV's, and is called a 'full factorial' design; there are also cases where levels are intentionally confounded to save data, called 'fractional factorial' designs.) If the explanatory variables are orthogonal, your third regression will yield absolutely, exactly 0. On the other hand, in an observational study the covariates are pretty much always correlated. The stronger that correlation, the less information exists in $x_n$. These facts will modulate the relative merits of regression $(1)$ and regression $(2)$.

However, (unfortunately perhaps) it's more complicated than that. One of the important, but difficult, concepts in multiple regression is multicollinearity. Should you attempt to estimate regression $(4)$, you will find that you have perfect multicollinearity, and your software will tell you that the design matrix is not invertible. Thus, while regression $(1)$ may well offer an advantage relative to regression $(2)$, regression $(4)$ will not.

The more interesting question (and the one you're asking) is what if you use regression $(1)$ to make predictions about $y$ using the estimated $x_n$ values output from the predictions of regression $(3)$? (That is, you're not estimating regression $(4)$—you're plugging the output from the prediction equation estimated in regression $(3)$ into prediction model $(4)$.) The thing is that you aren't actually gaining any new information here. Whatever information exists in the first $n-1$ predictor values for each observation is already being used optimally by regression $(2)$, so there is no gain.

Thus, the answer to your first question is that you might as well go with regression $(2)$ for your predictions to save unnecessary work. Note that I have been addressing this in a fairly abstract way, rather than addressing the concrete situation you describe in which someone hands you two data sets (I just can't imagine this occurring). Instead, I'm thinking of this question as trying to understand something fairly deep about the nature of regression. What does occur on occasion, though, is that some observations have values on all predictors, and some other observations (within the same dataset) are missing some values on some of the predictors. This is particularly common when dealing with longitudinal data. In such a situation, you want to investigate multiple imputation.

Related Question