Solved – How to predict an outcome with missing variable

missing datapredictionregression

Suppose I have two datasets:
One is the training dataset, which has dependent variable $Y$ and independent variables $X_1$, $X_2$, and $X_3$;

Now I have another dataset, which only contains $X_1$ and $X_2$, note that neither $X_3$ nor $Y$ are observed.
My question is: can we still predict $Y$ from the second dataset, using the regression results from dataset one?

Best Answer

The answer is no. The model contains the value for X3 and if your newdata does not contain this value it will throw an error.

See following example. I set mtcars in a dataframe, remove the Y column and one of the X columns (cyl). Create a simple lm model based on mtcars and predict on df.

df <- mtcars
df$mpg <- NULL # remove Y
    df$cyl <- NULL # remove one X value


lm_model <- lm(mpg ~ ., data = mtcars)
predict(lm_model, df)
Error in eval(expr, envir, enclos) : object 'cyl' not found

What you could do is impute the missing column to your newdata with the value 0, mean, median or use an imputation function. But then you should realize that your model predictions will be off.

Edit based on comment:

Off course, you can skip the X3 column and just regress on X1 and X3. Imputing the value is always a decision based on experience. Median in case of outliers, mean if the data is normally distributed, or imputing if you can deduce the value of X3 based on values of X1 and X2. For example, you could look into the package mice and it's vignette. There are more packages for imputing missing data, but it some of it depends on why the data is missing in the first place. There is a whole body of work on how to deal with missing data.

Here is a link to chapter 25 of Andrew Gelman's book Data Analysis Using Regression and Multilevel/Hierarchical Models dealing with missing data.

and here is a link to a list of software package dealing with missing data. The author of mice also wrote an article on this package in the journal of statistical software which you can find here