Solved – Remove features with high correlation

classificationcorrelationfeature selectionmachine learningsvm

In a classification problem using Linear SVM, I am trying to remove variables which have a strong correlation (Pearson) between them from a dataset.

  • What is the usual threshold recommended? I currently delete variables when they have a correlation >= 1.0 or <= -1.0 but I wonder if I should use 0.5 instead.
  • Should I create my correlation matrix after or before scaling the data ?

Best Answer

I don't think this can be answered in the abstract, we would need to know what is the goal, and what is the variables. Some possibilities:

Two variables which basically are different measurements of the same thing with independent errors. Then keep both (or use their mean) looks reasonable. Other cases are less clear-cut. Since you have a classification problem, the following is relevant: Feature Selection: Correlation and Redundancy even highly correlated variables could have non-redundant information, and in the example given removing any of them would destroy its information content. But, interestingly, in that example replacing the two variables with their difference would work! That underscores that each example have to be treated on its own. Some general threshold value which would always work would be difficult to find!

Related Question