Solved – Heuristic Feature Selection for Gradient Boosting

boostingmachine learningpythonscikit learn

If I am trying to select from two different sets of features for a Gradient Boosting Machine, but I do not want to run through training an entire model on each set, could I differentiate performance with a lower number of trees?

Suppose based on the other parameters, I need about 1000 trees for the best fit ultimately. If I just want to see if one set of features will probably perform better than another, can I trim the number of trees to 50 and then validate? Or even 5? Does the implementation work in a way that the best trees are chosen early on, and I could assume a lower number of trees might be indicative of ultimate performance or would there be problems at validation? I am using scikit-learn, and I am a bit new, so I just wanted to be sure about how it works.

In short, are early tree fits somewhat indicative of feature importance?

Best Answer

To answer the direct question, the tree position is NOT indicative of importance. GBM & Random Forest both select the splitting variables randomly & hence early selection is not indicative of feature importance.

You can read the details here