Solved – How should I deal with highly correlated features

correlationfeature selectionfeature-engineeringmachine learningpca

I'm doing multi-class classification on the Abalone dataset by divided the abalone into age groups young, adult and old.

While doing so, I found that the columns for the abalone size and weights were highly correlated. I'm also using the sex categories via one-hot encoding.

enter image description here

The dataset info also mentioned that "Data set samples are highly overlapped. Further information is required to separate completely using affine combinations."

What are the implications of this high correlations, and how do I perform my feature engineering knowing this?

Best Answer

In theory, this should not affect your ability to make predictions - after all, the only truly useless data would be a restated column (or a column whose values can be directly derived from some other - e.g. having radius and circumference in two columns). Just because your features are correlated does not mean they are not useful, in fact, this correlation could be valuable if your dataset is in fact representative of what is out there "in the wild".

However, if your dataset is limited, then you may run into trouble, as highly correlated data will provide precious little extra information about the subject. PCA is a great candidate for this, as mentioned in the above comment. Random Forests are also promising, as they can inform you which columns play the biggest part in classifying your data. Gradient Boosting classifiers can also help with data that is resistant to classification by more elementary methods.

At any rate, I'm curious to hear what your baseline is with basic classifiers!

Related Question