Solved – Feature Selection: Correlation and Redundancy

classificationcorrelationdata preprocessingtime series

Assume having several numerical, multidimensional time-series. As preprocessing of further Analysis, I firstly check for relevance and then for redundany of all dimensions/Features.

1) Check for relevance:
I will exclude all dimensions with a variance of 0 over the whole dataset – since this specific Dimension does not contain Information that helps to classify/distinguish the time-series from each other.

2) Check for redundancy:
I compute the correlation of all dimensions/Features with each other and my Intuition says (here is my question) that those Features-pairs which correlate by either -1 or +1 are redundant. Whereas a high correlation such as 0,99 seems to be redundant, it is not. Only a correlation of either -1 or +1 means redundancy.
Therefore i will randomly exclude on the two dimensions/Features which correlate by +1 or -1.

I am yet sceptical whether or not this is a correct assumption. Are there any leads that could prove / disproves my intuiton of the Connection between redundancy and correlation?

Best Answer

High absolute correlation does not imply redundancy of features in the context of classification.

An example is given in the textbook Feature Extraction - Foundations and Applications by I. Guyon et al. (p.10, figure 2 (e)) I reproduced the example for visualization with matplotlib and in Python.

enter image description here

In this example both features correlate highly, yet seperation of classes will only be achieved if both features are used. Therefore correlation does not imply redundancy.