Solved – XGBoost feature subsampling

baggingboostingclassificationhyperparameteroverfitting

I have a dataset with ~30k samples and 35 features (after feature selection; these seem to be the most important features for this dataset and they have low correlation between each other).

After doing grid search with 10-fold CV on the hyperparameters, to my surprise I get the lowest validation error when colsample_bytree is such that only 1 feature is sampled for each tree! (Edit: actually, with 2 features sampled per tree it works slightly better – but if I increase the number of features sampled per tree the performance keeps getting worse). The depth of each tree is 3 and I am building 2000 trees. That is, for each tree, a feature is randomly selected, and then xgboost tries to fit to residuals using only that feature.

That seems to be very unusual. How should I interpret this? If I have feature interactions in my trees, I start to overfit? But then I would expect performance with trees of depth 1 and no feature subsampling to perform just as good, yet they don't. In fact, in the grid search, nearly all models with such extreme feature subsampling did better than models without feature subsampling.

Edit: is it possible that I have some features that fit well to the training set but generalize very poorly, and such individual feature sampling helps to avoid those features dominating the model? I am struggling to see what else this could mean.

Edit2: Tried removing individual features, performance does not improve, which suggests that my hypothesis from the previous edit is unlikely. On the other hand, I found that the optimal performance is actually when I sample 2 features per tree. At least now my features are interacting, but still, I am not sure how to explain this gain in performance.

Edit3: This paper is somewhat relevant, but not really. In random forests, the optimal setting for feature subsampling is usually the square root of the number of features: "Influence of Hyperparameters on Random Forest Accuracy" by S. Bernard et. al., 2009. It is easier to see why it would be more useful in random forests, however, which rely on variability in the trees and do not fit to the residuals like XGBoost.

Best Answer

You seem to fine-tune the wrong things.

On your feature selection: I don't think that this is done properly:

  • You remove the good feature and all linearly correlated features. That's nice, but higher order correlated features are still there. On the other hand, strong correlation does not always mean that the feature is useless.
  • So you should keep the good feature in the set and remove all the features that are useless. The goal is to still have a high score in the end. This way you make sure you don't remove good features as you would notice it because the score decreases.
  • You should train a good model (at least once in a while) in order to know which features are helpful.

For the hyperparameter optimization:

  • you should fix some variables in the beginning (all except n_estimators), optimize (roughly) that parameter with a more fine grained grid (from 10 to 500 in steps of 20 for example).
  • My general suspicion: Way to many estimators, to low learning_rate (at this stage) and to shallow (set depth to 6). Try maybe the following:

    • eta = 0.2
    • n_estimators = [50...400]
    • subsample = [0.8]
    • depth = 6

    and leave the rest as is. Of course, those depend strongly on the data.

A nice guide for XGBoost hyperparameter optimization can be found here.

So I'd propose you to redo the feature selection keeping the good features in the set and sometimes use a good XGBoost configuration by optimizing it. Do not forget to maybe create a small holdout set which you do not use in the feature selection. This can be used in the end to know the real performance.