Random Forest – Best Practices with Data Wrangling Before Running Random Forest Predictions

data preprocessingpredictive-modelsrandom forest

When doing predictions with Random Forests, we very often (or always) need to perform some pre-processing. Since I have a background of Computing and pretty much all I know from statistics comes from self-learning, this process becomes more intuition and less theory.

For instance, some of the things I get stuck with is dealing with

  1. Outliers. Should we remove them all? If so, we consider an outlier based on the 3/2 rule? Should we keep them? Why?
  2. When dealing with deltas of observations (as an example, suppose I'm subtracting a student grade from another), should I normalize the delta of all students or just stick to the absolute delta?
  3. Sticking to the same student case, If I have cumulative data (suppose for every test I sum their last grades). Should the process be the same?
  4. Do we need to apply any data transformation like log or any other? If so, when should it be done? When data range is large? What's the point of changing the domain of the data here?
  5. If I have a Categorical target, can I apply regression instead of classification so the output would be (suppose the classes are 0, 1, 2) 0.132, 0.431. Would it be more accurate?
  6. In what kind of problems is Random Forest more indicated? Large datasets?
  7. Should I discard the less important variables? Maybe it just creates noise?

I know the pre-processing depends on the problem, data, etc. and I know there are a lot more things to look for when pre-processing. Here I'm more trying to understand the concepts behind pre-processing the data and the key points to look for when doing so. So with that in mind, what would be the key points to look for when pre-processing data? (If I didn't mention any other important point, and I'm sure a lot is missing, please consider that too). Imagine you're teaching that to your grandpa 🙂

Best Answer

When doing predictions with Random Forests, we very often (or always) need to perform some pre-processing.

This is not true. Random Forest is really "off-the-shelf".

Outliers. Should we remove them all? If so, we consider an outlier based on the 3/2 rule? Should we keep them? Why?

The base model used in RF is a large decision tree (usually built via CART). Decision trees are robust to outliers, because they isolate them in small regions of the feature space. Then, since the prediction for each leaf is the average (for regression) or the majority class (for classification), being isolated in separate leaves, outliers won't influence the rest of the predictions (in the case of regression for instance, they would not impact the mean of the other leaves). Bottom line: you don't care about outliers in RF. Just remove them if they are aberrant observations (e.g., due to recording errors). If they're valid cases, you can keep them.

When dealing with deltas of observations (as an example, suppose I'm subtracting a student grade from another), should I normalize the delta of all students or just stick to the absolute delta? Sticking to the same student case, If I have cumulative data (suppose for every test I sum their last grades). Should the process be the same?

The question here is not really related to RF, it is algorithm independent. The real question is what do you want to do? What are you trying to predict?

Do we need to apply any data transformation like log or any other? If so, when should it be done? When data range is large? What's the point of changing the domain of the data here?

For the same reasons you don't need to worry about outliers, you don't need to apply any kind of data transformation when using RF. For classification, you may need to apply some kind of resampling/weighing strategy if you have a class-imbalance problem, but that's it.

If I have a categorical target, can I apply regression instead of classification so the output would be (suppose the classes are 0, 1, 2) 0.132, 0.431; so would it be more accurate?

You cannot apply regression if your target is categorical.

In what kind of problems is Random Forest more indicated? Large datasets?

RF is indicated for all types of problems. People (especially in the medical field, genomics, etc.) even use it primarily for its variable importance measures. In genetics, where the guys face the "small $n$ - large $p$" problem, RF also does very well. Anyhow, Machine Learning in general requires sufficient amounts of training and testing data, though there's no general rule. If your training data represents all your concepts and if these concepts are easily capturable, a couple of hundred observations may suffice. However, if what should be learned is very complex and if some concepts are underepresented, more training data will be needed.

Should I discart the less important variables? Maybe it just creates noise?

Another nice feature of decision trees built through CART is that they automatically put aside the non-important variables (only the best splitters are selected at each split). In the seminal book by Hastie et al. (2009), the authors showed that with 100 pure noise predictors, and 6 relevant predictors, the relevant variables were still selected 50% of the time at each split. So you really don't need to worry about variable selection in RF. Of course if you know that some variables are not contributing, don't include them, but if the underlying mechanisms of the process you're studying are mostly unknown, you can include all your candidate predictors.

Related Question