Solved – Feature selection in GBM

caretfeature selectionmachine learningr

I am using gradient boosting (caret package in R). As far as I understand, the feature selection is already included in this package. However, I slightly misunderstand how it works.

I made 2 experiments: in the first experiment I took 1000 examples for training and 300 examples for cross validation. Then I trained the model with 10 features and the error on the cross validation set was 5%. In the second experiment I added 3 new features (totally, 13), trained the model with 13 features and received the error of 7%. So, the error increased after adding features.

Why does this happen if theoretically most influential features should have been selected by GBM. I expected to receive as maximally 5% error in the second experiment. So, I don't understand why the error increased. Aslo, how can I avoid this negative effect. Which methods can I use? (some links to R tutorials would be highly appreciated).

Best Answer

A few things:

  • gradient boosting is wrapped by caret. The gbm package implements that model
  • Your error estimates from CV are probably not good since you are doing feature selection outside of resampling. Google '"feature selection" "selection bias"' to see scholarship on this subject.
  • those (incorrect) error estimates might not be the same given the amount of experimental noise in the data.
  • tree ensembles are not perfect. I haven't done the experiment with gbm but with random forests there can be a slight increase in the error rate as you add non-informative predictors. See Fig. 19.1 in Applied Predictive Modeling that shows this effect for a variety of different models.