Solved – R gbm package variable influence

boostingfeature selectionr

I'm using the excellent gbm package in R to do multinomial classification, and my question is about feature selection.

After deciding the number of iterations using cross validation (for a given shrinkage and interaction.depth), do i need to re-run the model using only the 'important' features, or it will automatically do this feature selection for me?

In other words, after the initial fit, do i need to exclude the 'irrelevant' variables and re-fit the model?

Thanks!

EDIT: This question is more about the way the package and algorithm implementation works.

Best Answer

  • Would recommend the review article on gbm co-authored by Hastie. Good fundamental review of gbm and would trust more than my opinions. A working guide to boosted regression trees Journal of Animal Ecology 2008 http://avesbiodiv.mncn.csic.es/estadistica/bt1.pdf

  • Variable selection is one of strongest appeals of machine learning algorithms vs. traditional likelihood-based models. By having built in regularization a prior pre-specified models are not critical to model performance. Not sure if I agree with comment regarding irrelevant variables above.

  • One of the much quoted strengths of machine learning algorithms is that they can potentially utilize a large number of weakly important variables and thereby have an improve on prediction. If you eliminate a large number of predictors that each individually have limited incremental utility, you can cumulatively have a negative impact your model.

  • Usually the inclusion of irrelevant variables is not thought to negatively impact model prediction. Figure 15.7 in ESL. http://web.stanford.edu/~hastie/ElemStatLearnII/figures15.pdf

  • You thus don't have to remove variables from your model unless there are other reasons why you might want to (ease of implementation etc.).

Edit - Would also add that most variable selection methods implicitly seek to find all relevant features rather than pasimonious feature set. Often the distinction isn't made as clearly as it could be. I haven't use the vsurf package, but I like the fact that it explicitly differeniates these two obejctives. Using importance measures is likely to give you an all relevant feature set, rather than a sufficient parsimonious set. But that may be good enough.