Solved – How to compute the correlation between two features and their output class

correlationdimensionality reductionfeature selection

I just read these two quotes in a paper and wondered how I do this.

For each classifier, one selects a subset of the input features
according to their correlation to the corresponding class.

and

Compute the correlation between each feature and Yi the output for class i.

Source: Dimensionality Reduction Through Classifier Ensembles

Can you tell me how to accomplish that?

Best Answer

Correlation is used as a method for feature selection and is usually calculated between a feature and the output class (filter methods for feature selection). It roughly translates to how much will the change be reflected on the output class for a small change in the current feature. If the change is proportional and very high, then we say that the feature is highly correlated with the output class and is usually a very good idea to keep it around for any end of the pipeline tasks.

That being said, a feature having higher correlation in say corpus XYZ won't necessarily mean that it will have a high correlation in some other corpus say ABC and thus cannot be translated from one dataset to another.

There are many metrics to measure the correlation between a feature and a class label such as mutual information, chi-square, correlation coefficient scores etc.

numpy.correlate in python does the trick for most of my work. or if i want to use mutual information then sklearn has mutual info score .

Related Question