Solved – Generalising correlation-based feature selection

classificationdata miningfeature selectionpredictive-modelsregression

One method to perform feature selection consists in calculating Pearson's correlation coefficient between each explanatory variable $X_i$ and the response variable $Y$. Then the absolute values of the coefficients are sorted in decreasing order and the first $K$ variables are selected.
As far as I know, Pearson's correlation coefficient can be calculated only between two quantitative variables. So:

  1. How can I perform feature selection if the dataset includes both qualitative and quantitative explanatory variables?
  2. In a classification problem with 0-1 response, is the above method still correct? (In other words, does it make sense to calculate Pearson's correlation coefficient between a binary variable and a quantitative variable?)
  3. What about a multiclass classification problem?

Best Answer

With respect to your question #1, if one goes down the path of using pairwise measures of association as a first step in variable selection, then clearly, Pearson and Spearman correlations aren't appropriate with discrete variables. Identifying a standardized metric that works for both continuous and categorical features becomes the goal. One possibility is to use the magnitude of the value from an F-test as in a one-way ANOVA (see a related link in sidebar to this thread).

However, and as Tim notes in his link, there are many reasons for not using pairwise measures of dependence as a first step in variable selection in multiple regression. These are largely a function of issues related to the conditional nature of model building -- which are not captured in pairwise comparisons -- and the loss of potentially important moderator or interaction effects when applying a rigid cutoff for variable inclusion.

Breiman recommended leveraging the "meta-" output from his random forest routine as input to variable selection. Basically, this involves tracking the performance of a feature in the context of the alternative data and feature landscapes into which it is selected -- a kind of Darwinian approach to variable selection that gets around the absence of linear combinations in pairwise metrics. In addition and depending on how you work it up, it can evaluate moderator or interaction effects...up to the limits of your machine in taking 2-way or higher interactions (the possibilities quickly become enormous, even with a relative handful of features). If you want to get really picky, you can substitute logistic regression as the RF model building framework to deal with the question #2 issue you raise.

That said, among the challenges with variable selection is identifying a "best" methodology. The fact is that every statistician and their brother has an algorithm with a working paper that they use and advocate or, alternatively, algorithms that they disparage. There is wide agreement that stepwise approaches -- which are rooted in significant p-values -- do not deliver "good" answers. There is also wide agreement that Tibshirani's Lasso does provide reasonably good answers, e.g., the NormalDeviate blog described it as one of the top statistical contributions of the last 10 or 20 years. So, if you're looking for an approach to variable selection that easily handles both categorical (including multiclass) and continuous variables and that has wide support in the statistical community, then the Lasso is a safe bet.

My concern has always been with the scalability of any of these statistical solutions, e.g., the Lasso breaks down when faced with a massive number of features or predictors (this isn't a concern of the OPs but I'll discuss it anyway). By "massive" I mean candidate features numbering in the thousands, hundreds of thousands or, I suppose, even millions are possible for some applications. RFs are scalable as a "divide and conquer" method if you have a massively parallel system and are able to run millions of "mini-models" in a finite amount of time.

What are some other similarly scalable solutions? This is an open-ended, rhetorical query.