Solved – Missing Values NAs in the Test Data When using predict.lm in R

forecastingleast squaresmissing datarregression

I have two data sets

  1. Train data
  2. Test data (with no dependent variable values but I
    have data on independent variable or you can say I need to
    forecast).

Using the training data (which has some NAs in the cell) I performed ordinary least square regression (OLS) using lm() in R and fitted the model & I got the $\beta $ coefficients of the regression model. (All is good so far!)

Now, in the process of prediction for the fitted values, I have some missing values for some cells in the test dataset. I used function predict() as follows:

 predict(ols, test_data.df, interval= "prediction", na.action=na.pass)

for the cell (or cells) with NA value the entire row is discarded in generating the output (yhat). Is there any function that could generate the yhat values (other than NAs) for the test data without discarding any rows with missing value in the cell.

Best Answer

First, let me preface this by stating that missing data is its own specialty in statistics, so there's lots and lots of different answers to this question.

As you've discovered, by default, R uses case-wise deletion of missing values. This means that whenever a missing value is encountered in your data (on either side of your regression formula), it simply ignores that row. This isn't great, since if you have 100 observations, but half of your rows has at least one variable value missing, you effectively have 50 observations. In some disciplines, the prevalence of missing data can rapidly diminish the size of your data. When I was an undergraduate, I analyzed a 3,000-person survey which shrank to just 316 people when using case-wise deletion!

But this gets even worse than shrinking your sample size: there may be hidden problems, such as an association between the pattern of missingness and the value of the missing element. For example, people with higher income are more likely to not disclose their salary. This will make it difficult to conduct meaningful, statistically sound judgments related to income.

One common method for dealing with missing values is imputation. There are many packages for imputation in R available. In my specialty area, political science, a widely-used one is AMELIA II, by Gary King. This treats your variables as multivariate normal and iteratively improves its "guesses" of what the missing values must be based on some convergence criteria: when convergence is declared when the "guess" seems to fit well with the rest of the data. (I'm sorry that this is nonspecific. I haven't used AMELIA II in several years. The documentation is thorough and lucidly written, so I would start there.)

But this is just one option. I'm sure that more knowledgeable people will speak up with their contributions.

Related Question