Solved – GBM and highly correlated predictors

boostingcorrelationlassor

I have a data set with 70 predictors, 68 numeric and 2 factors. When I build a gbm model using all predictors, I get an R2 of 0.767 and RMSE of 175.15. I get similar numbers on the training and the test set. When I build a model after somewhat arbitrarily removing some predictors I know are highly correlated I get a higher R2 (0.78) and a lower RMSE (165.07). Similar results on train and test set.

Is there a method similar to lasso (like regsubsets() in R) that would try to find an optimal subset of predictors for a gbm model?

Also, I thought that gbm had built in feature selection and would not be affected by the correlated predictors [1] ?

Best Answer

Also, I thought that gbm had built in feature selection and would not be affected by the correlated predictors?

While in some sense this is true, do not take it to mean that gbm is a miracle worker. If the noise to signal ratio in your data is high, and you give gbm enough chances to mistake noise for signal, it will do so.

Here's an example. I'll generate $50$ predictors independent of a random response, and then let gbm look for a relationship

set.seed(154)
library(gbm)

df <- data.frame(matrix(rnorm(50*100), nrow = 100))
df$y <- rnorm(100)

M <- gbm(y ~ ., data = df, 
         n.trees = 1000,
         interaction.depth = 2,
         cv.folds = 10,
         n.cores = 4)

Even though there is no signal to find here, gbm still is able to decrease the cross validation error from the null model

gbm.perf(M)

enter image description here

gbm is telling me the optimal tree depth here is about 600, although the true value is zero. Notice how cross validation (or out of sample testing in your case) is not protecting me here, as sometimes the same patterns appear by chance in the training and hold out data. Notice also that this happens even though the training data is truly uncorrelated.

Of course, one way to alleviate this is to fit the model on more data, and this does indeed rectify the problem

df <- data.frame(matrix(rnorm(50*1000), nrow = 1000))
df$y <- rnorm(1000)

enter image description here

There is never a free lunch. If you pass the gbm algorithm predictors that bear a weak or no relationship to the response, you are always risking it finding false patterns.

Is there a method similar to lasso (like regsubsets() in R) that would try to find an optimal subset of predictors for a gbm model?

Attempts to find an optimal subset of predictors are likely to make this problem worse rather than better. Pre-screening can cause severe optimism in your model, and very quickly hurt its out of sample performance. I always recommend reading the section of Elements of Statistical Learning on this point titled "The Wrong and Right Way to do Cross Validation".