Feature Selection in Classification – Why Is It Important?

accuracyfeature selectionregression-strategies

I'm learning about feature selection. I can see why it would be important and useful, for model-building. But let's focus on supervised learning (classification) tasks. Why is feature selection important, for classification tasks?

I see lots of literature written about feature selection and its use for supervised learning, but this puzzles me. Feature selection is about identifying which features to throw away. Intuitively, throwing away some features seems self-defeating: it is throwing away information. It seems like throwing information shouldn't help.

And even if removing some features does help, if we are throwing away some features and then feeding the rest into a supervised learning algorithm, why do we need to do that ourselves, rather than letting the supervised learning algorithm handling it? If some feature is not helpful, shouldn't any decent supervised learning algorithm implicitly discover that and learn a model that doesn't use that feature?

So intuitively I would have expected that feature selection would be a pointless exercise that never helps and can sometimes hurt. But the fact that it's so widely used and written about makes me suspect that my intuition is faulty. Can anyone provide any intuition why feature selection is useful and important, when doing supervised learning? Why does it improve the performance of machine learning? Does it depend upon which classifier I use?

Best Answer

Your intuition is quite correct. In most situations, feature selection represents a desire for simple explanation that results from three misunderstandings:

  1. The analyst does not realize that the set of "selected" features is quite unstable, i.e., non-robust, and that the process of selection when done on another dataset will result in a quite different set of features. The data often do not possess the information content needed to select the "right" features. This problem gets worse if co-linearities are present.
  2. Pathways, mechanisms, and processes are complex in uncontrolled experiments; human behavior and nature are complex and not parsimoneous.
  3. Predictive accuracy is harmed by asking the data to tell you both what are the important features and what are the relationships with $Y$ for the "important" ones. It is better to "use a little bit of each variable" than to use all of some variables and none for others (i.e., to use shrinkage/penalization).

Some ways to study this:

  1. Do more comparisons of predictive accuracy between the lasso, elastic net, and a standard quadratic penalty (ridge regression)
  2. Bootstrap variable importance measures from a random forest and check their stability
  3. Compute bootstrap confidence intervals on ranks of potential features, e.g., on the ranks of partial $\chi^2$ tests of association (or of things like univariate Spearman $\rho$ or Somers' $D_{xy}$) and see that these confidence intervals are extremely wide, directly informing you of the difficulty of the task. My course notes linked from http://biostat.mc.vanderbilt.edu/rms have an example of bootstrapping rank order of predictors using OLS.

All of this applies to both classification and the more general and useful concept of prediction.