Solved – variable reduction before doing random forest in R

rrandom forestvariance-inflation-factor

I have a dataset featuring around 50 predictors, some of which are correlated. Now I am trying to fit a random forest model in R for prediction purpose with this dataset.

Because there are too many predictors, I want to remove some of the predictors. The only way I can think of is VIF analysis.

Is it correct to do VIF to remove variable before doing random forest? Are there other ways to reduce variables for random forests? Is it necessary to remove variables for random forest model?

Best Answer

There might be two reasons for which you would want to reduce the number of features:

  • Predictive Power: Random forest model accuracy does not really get impacted by the multicollinearity much. You can have a look at this. It actually selects random samples of the training data and also subsets of features while running each of the decision trees. So whichever feature gives it more decrease in impurity, it will pick that. That way, be it large number of predictors or correlated predictors the model accuracy should not be affected.

  • Interpretability : If you want to interpret the model output using the features and their impact, in that case you might suffer because of the multicollinearity. If two predictors are correlated and they are important, the tree will choose one of them and you might lose the other one if you have small number of trees. So for that you might wanna reduce features.

Methods : I would suggest you to use the inbuilt importance function in randomForest. This is calculating the importance of each feature based on Gini Importance or Mean Decrease in Impurity (MDI).

```

fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=2))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1)) 
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
                     cex.names=1,
                    main="Gini Impurity Index Plot")

This will give something like below, and you can exclude features with lesser importance.

enter image description here

You can also check other methods like

  • Permutation Importance or Mean Decrease in Accuracy (MDA)

  • Information Gain / Entropy

  • Gain Ratio

All these are really useful when the dependent is categorical. In case your dependent variable is continuous, you can follow the classical approach , which leads to the correlation calculation between each feature and the target.