Solved – building up a predictive model with lots of features and missing data

feature selectionmissing datapredictive-modelsrandom forest

I'm learning using R to build predictive models recently by myself and have many questions on how to attack a question. I'm given a data set of 8000 observations with 300 features. My goal is to build a predictive model to predict a target column of continuous values. I'm not given where the data is coming from, nor the actual meaning of the features (they are listed just as f1 to f300). 5 features are categorical. Almost all features have some missing values. Only 20 observations are complete cases. I have several questions:

  1. How to deal with the missing data? Would either MICE or Amelia package in r for multiple imputation be a good choice, but then how to deal categorical missing data?

  2. Should I do data imputation before building a predictive model or the other way around?

  3. Would random forest be a good choice in this case? How many features should be included in the decision tree?

Thanks in advance for any suggestions.

Best Answer

  1. There are several possibilities for handling missing data. A typical easy one is imputing the median for continuous and the modus for discrete predictors. Other more sophisticated methods are available (like e.g. imputation with random forest, see here for some possibilites with R-package mlr: http://mlr-org.github.io/mlr-tutorial/devel/html/impute/index.html)

  2. As most of the algorithms for predictive modeling cannot handle missing data, you should do imputation before building a model

  3. Random Forest (randomForest or ranger in R) and linear model (lm in R) are good first options for regression problems. Better results with a bit of parameter tuning can usually be obtained by boosting methods (e.g. with the xgboost package), but a bit more experience is needed for this.