Solved – Does feature selection help improve the performance of machine learning

boostingfeature selection

Does feature selection help improve the performance of machine learning?

For example, suppose I have a n>>p data set, does it help to select important variable before fitting a XGBoost model?

Best Answer

You should not have any variables that you feel would obviously not be influencing the dependent variable at all, that is have only a large pool of variables that you have a hypothesis around impacting the dependent variable; you wouldn't want your model to learn noise from variables that have no logical sense in being part of the independent variable space but have spurious correlations with other vars. But apart from those obvious exclusions, the point is, how would you know which features/variables are important and which are not? You may think a certain variable will not be of much importance but when you actually fit a model, it may come up as having much more discriminatory power than you'd thought!

In tree based ensemble methods, such as XGBoost, each variable is evaluated as a potential splitting variable, which makes them robust to unimportant/irrelevant variables, because such variables that cannot discriminate between events/non-events will not be selected as the splitting variable and hence will be very low on the var importance graph as well. However, a caveat here is that if you have two (or more) highly correlated variables, the importance that you get for these may not be indicative of their actual importance (though even this doesn't affect your model's predictive performance). So you may leave all your features in and run a few iterations to see how important/not they are and the ones that consistently lie at the bottom of the var imp chart can be excluded from subsequent runs to improve computational performance.

Related Question