Solved – Boruta ‘all-relevant’ feature selection vs Random Forest ‘variables of importance’

borutafeature selectionimportancerandom forest

Can someone explain the difference between variables of importance from random forest vs all-relevant features from Boruta feature selection?

For example, if one were to build a model (could be any model) using a sub-set of 'important' or 'relevant'features, would it be better to use the output from Boruta all-relevant feature selection, or the Random Forest 'variable of importance'?
Is one method preferred over the other? If so why?

Best Answer

Boruta and random forrest differences

Boruta algorithm uses randomization on top of results obtained from variable importance obtained from random forest to determine the truly important and statistically valid results. For details of the difference please refer to Section 2 of the article:

Kursa, Miron B., and Witold R. Rudnicki. "Feature selection with the Boruta package." (2010).

Is one method preferred over the other? If so why?

This is a classic case of "No Free Lunch" theorem. Without data and assumptions, it is impossible to decide which one is better. However, please note Boruta is produced as an improvement over random forest variable importance. So, it should perform better in more situations than not (Biased because I like randomization techniques myself). Nevertheless, data and computational time could make variable importance from random forest a better choice.

Related Question