Solved – Correlation and Variable importance in Random Forest

machine learningrandom forest

I have a dataset, which has 5000 observations and 100 variables, of which many are correlated. It is a classification problem so I am thinking of going for Random Forest for prediction and variable selection. My doubt is do I need to take care of multicollinearity before putting the data in the RF model or does RF automatically takes of multicollinearity problem?

Best Answer

In theory, multicollinearity is not a problem for RF. This is because each node of each tree is constructed by finding single predictor and cutpoint for it. So only one candidate-predictor is examined at once. That is why relationships between predictors do not create problems. RF simply never looks at more than one predictor at once. Plus, at each node, only a subset of predictors are taken into account, which is another anti-collinearity feature of RF's.

In practice, however, there are some pitfalls. Imagine two very closely related predictors (say X and Y). If they're good predictors, RF will decide between them and it'll use both X and Y more-or-less similarly often. Why this is problematic?

If you are interested in variable importance, you may conclude that both X and Y are "quite important". If you used only X, it'll take over most of Y's importance, and you'll conclude that X is "very important". And since Y is closely reated to X, it is very important too. The latter seems more reasonable.

If you are interested in using your RF to predict future observations, you'll be forced to provide X and Y for each observation, which can be difficult, expensive, long-drawn and so on.

To sum up: multicollinearity is not a problem for RF algorithm, but it may be a problem for RF user.