Solved – Random forest and ridge regression

machine learningmulticollinearityrandom forestregression

Can we apply the concept of ridge regression in random forest for predicting the values in order to get more accurate results?

Random forest using regression trees for the prediction. When there is a problem of multicollinearity we will use ridge regression. Multicollinearity definitely can affect variable importances in random forest models. To overcome those multicollinearity in random forest can we use the concept of ridge regression?

Best Answer

For predictive accuracy, I would not expect multicollinearity to be a problem for random forests. For variable importances, it is much more likely to be a problem.

Combining random forests and penalized (e.g., ridge) regression can be done with R package pre. This package fits prediction rule ensembles, by first fitting a tree ensemble (bagged, boosted and/or random forest) and then selecting the best nodes through penalized regression (lasso, ridge or elastic net). In the following example, we fit a random forest and prediction rule ensemble on the airquality data. In this dataset, there is a substantial (negative) correlation between Temp and Wind and substantial (positive) correlation between Temp and Month:

airq <- airquality[complete.cases(airquality),]
round(cor(airq), digits = 2L)  
##         Ozone Solar.R  Wind  Temp Month   Day
## Ozone    1.00    0.35 -0.61  0.70  0.14 -0.01
## Solar.R  0.35    1.00 -0.13  0.29 -0.07 -0.06
## Wind    -0.61   -0.13  1.00 -0.50 -0.19  0.05
## Temp     0.70    0.29 -0.50  1.00  0.40 -0.10
## Month    0.14   -0.07 -0.19  0.40  1.00 -0.01
## Day     -0.01   -0.06  0.05 -0.10 -0.01  1.00

Now we fit a random forest and a prediction rule ensemble (taking a random forest + ridge regression approach through specification of the mtry and alpha arguments, respectively):

library("randomForest")
set.seed(42)
rf <- randomForest(Ozone ~ ., data = airq)
library("pre")
set.seed(42)
re <- pre(Ozone ~ ., data = airq, mtry = ncol(airq)/3, alpha = 0)

Now we request and plot the variable importances:

rf_imp <- randomForest::importance(rf) 
par(mfrow = c(1,2))
barplot(t(rf_imp), main = "random forest")
pre::importance(re, main = "prediction rule ensemble")

varimps RF vs PRE

We see that the variables Temp, Wind and Solar.R have very similar relative importances in the RF and PRE. The relative importances of Day and Month are lower in the PRE than in the RF.