Solved – Feature selection by univariate correlation with class labels

classificationcorrelation

I was reading on page 245 of Hastie et al. (Elements of statistical learning) about how not to do feature selection (basically they describe what happens when independence between test and training set is not given due to using the entire data for feature selection). There they perform the following experiment to exemplify their point:

"Consider a scenario with N = 50 samples in two equal-sized classes,
and p = 5000 quantitative predictors (standard Gaussian) that are
independent of the class labels. The true (test) error rate of any
classifier is 50%. We carried out the above recipe, choosing in step
(1) the 100 predictors having highest correlation with the class
labels, and then using a 1-nearest neighbor classifier, based on just
these 100 predictors, in step (2). Over 50 simulations from this
setting, the average CV error rate was 3%. This is far lower than the
true error rate of 50%."

I just have a detail question: This is a two-class problem, so I assume the labels are for instance either 1 or 0 – how is it possible then to calculate correlation values between features and labels? I understand that this can be done in the regression context where the response variable can take on any value, but I am not sure about the classification context.
Thanks

Best Answer

You will do it exactly in a same way as with continuous variables. Only your dependent variables have value of 0 or 1. You Pearson's correlation will basically became a t-test

Related Question