Solved – Features selection: why does Boruta confirms all the features as “important”

feature selectionmachine learningoverfitting

I want to run a feature selection study to select only the most important features, before running a machine learning classification. My data is 30,000 x 17 (Observed objects x Features). I use the R implementation of Boruta, with default parameters.
My results is: all my 17 features are green (confirmed as "important"). It is suspicious because it is likely that some are not and should be dropped.
When I only use a sub-set of observations (eg 100 randomly chosen observations among 30,000), the Boruta algo then changes drastically: 6 features are red (unimportant) and 11 are green (important).
Why do I have such different results, is it overfitting? How should I perform to make sure I correctly identify the less and most relevant features among the initial set of 17?

Best Answer

I had similar experience as yours with real-life data. Boruta does not give you any guarantees, you should treat it's output rather as a "suggestion", then definite answer.

This was even discussed by Kursa and Rudnicki (2010) in their paper about Boruta:

One should note that the Boruta is a heuristic procedure designed to find all relevant attributes, including weakly relevant attributes. Following Nilsson et al. (2007), we say that attribute is weakly important when one can find a subset of attributes among which this attribute is not redundant. The heuristic used in Boruta implies that the attributes which are significantly correlated with the decision variables are relevant, and the significance here means that correlation is higher than that of the randomly generated attributes.

You could try also other methods, e.g. entropy-based (check FSelectorRcpp project).

Feature selection algorithms are far from perfect. Marcin KosiƄski compared performance of three different methods and got three different solutions from each.

enter image description here

(source: r-addict.com)


Kursa, M.B., & Rudnicki, W.R. (2010). Feature selection with the Boruta package. Journal of Statistical Software, 36(11), 1-12.