Solved – Imputing missing values in Python using RandomForest model

data-imputationmissing datapythonrandom forestscikit learn

I know some strategies of imputing the missing data, for example, using filling with zeros, using mean, median or the most frequent values.

So what I don't quite understand till this point-how can the missing values be predicted in Python using some machine learning techniques such as RandomForestRegressor?

What steps should be taken to imputing the values by predicting them with RandomForest (or maybe other models, such knn, for example).

Best Answer

What I would do:

For each variable in your data I would regress it with the rest of the data, so for variable v1, you should regress it with v2 ... vn, that do not have an overlap in missing data with v1. You could save the names or indexes of the subjects that have missing data for v1 to a list and determine the overlap of missing values between variable one and the other variables. You should then only use the variables that do not have overlapping missing data with v1. After adding such an if statement, v2 would be regressed with v1, v3 ... vn, and so on. This way you will have a regression based on non-missing data.

After fitting the regression you can use the predictors (v2 ... vn) to predict the missing data in v1. Because you already know which subjects have missing data for v1, you can use the data of these subjects for the other variables: v2 ... vn to predict the missing data in v1 and then impute it.

By doing this for each variable, you will get an imputed dataset.

If you are not yet already, you can use Pandas to easily index the variables and subjects that have missing data.

Related Question