Solved – variable reduction before doing random forest in R

rrandom forestvariance-inflation-factor

I have a dataset featuring around 50 predictors, some of which are correlated. Now I am trying to fit a random forest model in R for prediction purpose with this dataset.

Because there are too many predictors, I want to remove some of the predictors. The only way I can think of is VIF analysis.

Is it correct to do VIF to remove variable before doing random forest? Are there other ways to reduce variables for random forests? Is it necessary to remove variables for random forest model?

Best Answer

There might be two reasons for which you would want to reduce the number of features:

Predictive Power: Random forest model accuracy does not really get impacted by the multicollinearity much. You can have a look at this. It actually selects random samples of the training data and also subsets of features while running each of the decision trees. So whichever feature gives it more decrease in impurity, it will pick that. That way, be it large number of predictors or correlated predictors the model accuracy should not be affected.
Interpretability : If you want to interpret the model output using the features and their impact, in that case you might suffer because of the multicollinearity. If two predictors are correlated and they are important, the tree will choose one of them and you might lose the other one if you have small number of trees. So for that you might wanna reduce features.

Methods : I would suggest you to use the inbuilt importance function in randomForest. This is calculating the importance of each feature based on Gini Importance or Mean Decrease in Impurity (MDI).

```

fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=2))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1)) 
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
                     cex.names=1,
                    main="Gini Impurity Index Plot")

This will give something like below, and you can exclude features with lesser importance.

You can also check other methods like

Permutation Importance or Mean Decrease in Accuracy (MDA)
Information Gain / Entropy
Gain Ratio

All these are really useful when the dependent is categorical. In case your dependent variable is continuous, you can follow the classical approach , which leads to the correlation calculation between each feature and the target.

Related Solutions

Solved – When to Log/Exp your Variables when using Random Forest Models

The way Random Forests are built is invariant to monotonic transformations of the independent variables. Splits will be completely analogous. If you are just aiming for accuracy you will not see any improvement in it. In fact, since Random Forests are able to find complex non-linear (Why are you calling this linear regression?) relations and variable interactions on the fly, if you transform your independent variables you may smooth out the information that allows this algorithm to do this properly.

Sometimes Random Forests are not treated as a black box and used for inference. For example, you can interpret the variable importance measures that it provides, or calculate some sort of marginal effects of your independent variable on your dependent variable. This is usually visualized as partial dependence plots. I'm pretty sure this last thing is highly influenced by the scale of the variables, which is a problem when trying to obtain information of a more descriptive nature from Random Forests. In this case it might help you to transform your variables (standardize), which could make partial dependence plots comparable. Not completely sure on this, will have to think on it.

Not long ago I tried to predict count data using a Random Forest, regressing on the square root and the natural log of the dependant variable helped a bit, not much, and not enough to let me keep the model.

Some packages with which you may use random forests for inference:

https://uc-r.github.io/lime

https://cran.r-project.org/web/packages/randomForestExplainer/index.html

https://pbiecek.github.io/DALEX_docs/2-2-useCaseApartmetns.html

Solved – Reduce Random Forest model memory size

I used this function to reduce my default caret-output from 137 MB to 3 MB. You can still use this model for prediction with $finalModel

## Clean Model to Save Memory

## http://stats.stackexchange.com/questions/102667/reduce-random-forest-model-memory-size
stripRF <- function(cm) {
  cm$finalModel$predicted <- NULL 
  cm$finalModel$oob.times <- NULL 
  cm$finalModel$y <- NULL
  cm$finalModel$votes <- NULL
  cm$control$indexOut <- NULL
  cm$control$index    <- NULL
  cm$trainingData <- NULL

  attr(cm$terms,".Environment") <- c()
  attr(cm$formula,".Environment") <- c()

  cm
}

Best Answer

Related Solutions

Solved – When to Log/Exp your Variables when using Random Forest Models

Solved – Reduce Random Forest model memory size

Related Question