Solved – How to apply a model on dataset with missing data

linearmissing dataregression-strategies

This question is similar to Missing input value during prediction of a generalized linear model.

Consider the following scenario:

I fitted a linear regression model on a training dataset with
sometime-missing predictor variables. Some imputation strategy was employed
to ensure that a missing indicator variable is not involved.

Now I would like to use the fitted model to perform prediction on a new dataset, which also has issue with sometime-missing predictor variables. Here are my questions:

  1. How should I go about applying the model? My first reaction is that – however imputation strategy is employed in the model fitting, perform it on the new dataset and then apply the fitted model. If I go down that path, do I need to make adjustment for the fact that imputed data is used?

  2. What if I have no information on what imputation strategy had been employed when the model was developed? (e.g. the model was developed by another researcher)

Best Answer

If you have access to the data set the model was trained on, you could impute new data and then compare means, standard deviations etc. to see how they differ.

You could also work backwards and use the model as is on a data set then compute statistics for that set, then try out different imputation techniques on the test data set and continue to generate statistics and compare and contrast them.

If predictor variables are missing, you might be able to throw out those data points if you have a sufficiently large enough data set to work of off. Also if this is the case, you could retrain the model using imputation and cross validation to achieve a desirable prediction score.

Remember if you do retrain the model on a new data set and there is missing data that you wish to impute, perform cross validation and split your data set into test/train sets before imputation. As this will mimic real life. Then perform imputation on the training set and once you have the trained model you want with the type of imputation you want. Perform that same data preprocessing on your testing set.

Hope this helps point you in the right direction!

Related Question