Solved – How to predict an outcome with missing variable

missing datapredictionregression

Suppose I have two datasets:
One is the training dataset, which has dependent variable $Y$ and independent variables $X_1$, $X_2$, and $X_3$;

Now I have another dataset, which only contains $X_1$ and $X_2$, note that neither $X_3$ nor $Y$ are observed.
My question is: can we still predict $Y$ from the second dataset, using the regression results from dataset one?

Best Answer

The answer is no. The model contains the value for X3 and if your newdata does not contain this value it will throw an error.

See following example. I set mtcars in a dataframe, remove the Y column and one of the X columns (cyl). Create a simple lm model based on mtcars and predict on df.

df <- mtcars
df$mpg <- NULL # remove Y
    df$cyl <- NULL # remove one X value


lm_model <- lm(mpg ~ ., data = mtcars)
predict(lm_model, df)

Error in eval(expr, envir, enclos) : object 'cyl' not found

What you could do is impute the missing column to your newdata with the value 0, mean, median or use an imputation function. But then you should realize that your model predictions will be off.

Edit based on comment:

Off course, you can skip the X3 column and just regress on X1 and X3. Imputing the value is always a decision based on experience. Median in case of outliers, mean if the data is normally distributed, or imputing if you can deduce the value of X3 based on values of X1 and X2. For example, you could look into the package mice and it's vignette. There are more packages for imputing missing data, but it some of it depends on why the data is missing in the first place. There is a whole body of work on how to deal with missing data.

Here is a link to chapter 25 of Andrew Gelman's book Data Analysis Using Regression and Multilevel/Hierarchical Models dealing with missing data.

and here is a link to a list of software package dealing with missing data. The author of mice also wrote an article on this package in the journal of statistical software which you can find here

Related Solutions

Imputation – The Advantage of Imputation Over Building Multiple Regression Models

I think the key here is understanding the missing data mechanism; or at least ruling some out. Building seperate models is akin to treating missing and non-missing groups as random samples. If missingness on X3 is related to X1 or X2 or some other unobserved variable, then your estimates will likely be biased in each model. Why not use multiple imputation on the development data set and use the combined coefficients on a multiply imputed prediction set? Average across the predictions and you should be good.

Regression – How to Extract Dependence on a Single Variable When Independent Variables Are Correlated

Aksakal's answer is correct. By controlling for all variables in a regression, you "keep them constant" and you are able to identify the partial correlation between your regressor of interest. Let me give you an example to make this clearer.

First, let us create some correlated $X$s.

 ex <- rnorm(1000)
 x1 <- 5*ex + rnorm(1000)
 x2 <- -3*ex + rnorm(1000)
 x3 <- 4*ex + rnorm(1000)

Now, since all these variables are generated by some underlying variable $ex$, they are clearly correlated. You can check this using cor(x1,x2), for instance.

Now, let us generate the dependent variable with known parameters.

 y <- 1*x1 + 2*x2 + 3*x3 + rnorm(1000)

Here we know that $\beta_1=1, \beta_2=2, \beta_3=3$. I have picked them arbitrarily. Let us now see if Aksakal's approach can uncover these parameters:

 lm(y ~ x1+x2+x3)

If it works, the estimated parameters should be close to the ones we have picked. Here the result:

 Call:
 lm(formula = y ~ x1 + x2 + x3)

 Coefficients:
 (Intercept)           x1           x2           x3  
    -0.01224      0.99805      1.99746      2.99670

As you can see, all parameters have been uncovered.

Having said that, there are many caveats involved here as well. Most importantly, you should not interpret these coefficients in a causal way. Depending on your actual situation, it might help if you explain a bit more what you are trying to estimate so that people can evaluate whether this method is appropriate (or whether answering your research question is feasible at all). For instance, why do you think your independent variables are correlated? Is it that $X_1$ might have an effect on $X_2$ and this has an effect on $y$? If this is the setup you have in mind, then depending on your field, you may want to look into mediator/moderator analysis or into quasi-experimental methods. Hence you see you might benefit from elaborating a bit more on your situation.

Best Answer

Related Solutions

Imputation – The Advantage of Imputation Over Building Multiple Regression Models

Regression – How to Extract Dependence on a Single Variable When Independent Variables Are Correlated

Related Question