Solved – the typical size of feature matrix for xgboost

boostingfeature selectionsubsampling

In other words, I have a binary classification problem with million samples and around 1000 features. I am trying to understand wheather I should subsample the dataset and add a feature selection step (and approximately how many features I should retain)

Best Answer

The typical number of features is the number of features that you think are relevant. No more, no less.

It might be possible to tabulate the number of features used in every XGBoost model, and compute descriptive statistics of that data, but this is folly -- there's also an average telephone number, but no one's suggesting that you should call it.

People derive features and estimate models to solve particular problems. What does oncology have to do with the price of tea?

Related Question