Solved – ROC as feature selection

ensemble learningfeature selectionmachine learningregularizationroc

It is apparently a common practice to use ROC as a feature selection method in my new job. They test the variables one by one against the response and anything with ROC<=54 is tossed aside.

Would you say that this is a good practice? I'm quit skeptical as I would have rather used ensemble learning in order to incorporate feature selection. Or used a regularization method like lasso/elastic net.

Best Answer

Univariate feature selection is generally a poor method.

This question is deftly answered by silverfish in the context of correlation, but all his arguments apply to your case as well. In short, there is no reason to believe that univariately checking how each individual variable $x$ is related to your response $y$ reveals anything about the multivariate nature of the relationship between $X$ and $y$. It's quite possible that you end up screening out many of your good predictors.

As you point out, LASSO, ridge, or glmnet are much preferred methods for feature selection in a multiple regression model, as they:

  • Take a fully multivariate view of your predictor / reponse relationship.
  • Avoid making high variance, binary decisions like "this variable is completely in, this variable is completely out".
  • Lend themselves naturally to cross-validation and other model validation techniques.

You should carefully and respectfully start pointing your team towards a more modern and disciplined approach.

(*) You also don't mention if your team is testing for non-linear relationships when fitting these univariate models. At the very least, these univariate models should be based on some basis expansion of the feature, like cubic splines. Clearly if they are only testing for univariate linear relationships, there there are some issues there as well.

Related Question