Solved – How to use Random Forest for categorical variables with missing value

categorical datamissing datapredictive-modelsrandom forest

I have a labelled dataset of 1M rows and 600 features. I am trying to build a supervised learning model for prediction. I am particularly working with Random forests in R. The data I have has following properties.

  1. Most of the features are categorical in nature.
  2. Each categorical variable has multiple levels ( some of them having 20 levels)
  3. Some of the features have data missing

Can random forests work without imputation of these missing values? If no, then what is the best way to impute these missing categorical values? Any literature or R functionality which addresses this issue will be really helpful.

Best Answer

Off the top of my head, I would say that this shouldn't be an issue. The rf package in R implements random forests using CARTs. One of the nicest thing about trees is how they are "natively" capable of dealing with categorical and missing variables. Here is the package documentation; you can download the package itself from CRAN.

Chapter 8 in James, Witten, Hastie, & Tibshirani's Introduction to Statistical Learning with Applications in R offers a good introduction to tree methods and also covers random forests on page 328.

Imputing missing variables is a whole thing in and of itself and, depending on your needs and data, you might be able to get away with not having to do it. If you do have to perform imputation you might want to check here and here for some quick pointers, but you're probably just going to have to read up on imputation methods and make a judgement call on what to go with.

Related Question