Solved – correlation of features and target in predicting red wine quality in machine learning

correlationfeature selectionmachine learning

In red wine dataset, there are 11 features such as acid, pH and one target(quality). Using those features to predict the quality score of the red wine. I want to choose the most important features to compose my design matrix. So I calculate the correlation between every feature and target, choosing 6 features which has a strong relationship with the target. Is it ok to ignore the relationship between every feature? for example, maybe acid an pH has a strong correlation, but I didn't consider the correlation between every feature, just consider the correlation between features and the target. if it is not ok, how do I do this? by using PCA?

Best Answer

The short answer is that it's okay to use correlation in that way, but I'll elaborate a bit further.

What you've done is a type of feature selection. More precisely, it's a filter method, which means we select a subset of the features based on some metric. Using correlation between a feature and the target is common practice because it's simple and fast to run. However, as you suggest in your question, calculating the correlation between every feature pair could improve your results by removing potential redundancies (a feature which is highly correlated with another won't add much extra information to the system). Though, by adding that to the equation the problem becomes more elaborate. See the following for an interesting method: http://www.ime.unicamp.br/~wanderson/Artigos/correlation_based_feature_selection.pdf

That said, correlation itself is a limited metric. Pearson correlation can only capture linear relationships, which is often not the case in machine learning. So, if you want a more sophisticated feature selection, I would suggest another metric, such as mutual information. Sklearn has a range of built-in methods you can choose from: http://scikit-learn.org/stable/modules/feature_selection.html

Now PCA isn't a feature selection method per se. It tries to represent a feature set with an artificial set of smaller dimension while maintaining most of the information content as the original data. In other words, both feature selection and PCA can produce a smaller feature set, but the former does so by removing unnecessary information, whereas the latter produces a new representation of the data. And, of course, you can use both methods together if you like.

Related Question