Feature Selection – Techniques for Uncorrelated Datasets

classificationcorrelationfeature selectionfeature-engineering

I am working on a speech emotion recognition problem and my training dataset consists of about $4000$ points of $138$ features each. The highest (Pearson) correlation among the features is $0.3$ and there are only $7$ features which are correlated in the range $(0.3, 0.4)$ to the target values.

Does it make sense to investigate feature selection techniques in this case ? To my understanding it is not, since the correlations between the features and between the features and the target are quite low. However, I would appreciate your thoughts in this because I do not have much experience in this field. Thank you.

Best Answer

Since you are working on a speech emotion recognition problem, I assume that's quite complex data, and I assume you are not using simple linear methods like linear regression. Please correct me if I'm wrong in this assumption.

General Notes about your problem:

  1. Don't forget Pearson correlation is only taking into account linear correlation between variables. There might be non-linear (polynomial, logarithmic etc.) relationships between your variables. There could also be step-function-like relationships. All of these things would be poorly captured by Pearson correlation.
  2. Since your Pearson correlations are low, it seems that the relationships in the dataset (if any) very well might be non-linear and complex (especially given the subject of the dataset). If there are complex or non-obvious relationships to be discovered, then it might be the case that 4,000 data points isn't enough. It's a good amount, and it might be enough (depending on your model), but just keep in mind that it certainly isn't a huge amount of data by any stretch, especially given how many features you have. Think about it this way - your model will have to try to identify relationships between all 138 features vs. target variable, and it's only given 4,000 data points to do so. It might not be able to capture everything there is to capture, so it might make sense to whittle down your feature set.

That's a perfect set up to directly answer your question:

Yes, it absolutely makes sense to investigate feature selection techniques.

Reasons:

  1. Just because the Pearson correlation is low doesn't necessarily mean there's no relationship. Feature selection methods might help you quickly figure out whether there is any more complex relationship to be discovered.
  2. For the reason mentioned in the second note I wrote above, if there are unhelpful variables within your 138, depending on your choice of model, it might be very helpful to get rid of them so your model can focus on analyzing the relationships between the actually useful variables vs. target variable.

Pointers to get started on feature selection:

Again, it depends on your model, but broadly speaking, I would heavily recommend some version of Permutation Feature Importance to figure out which features are helpful. Read more here: https://scikit-learn.org/stable/modules/permutation_importance.html

There are various packages that implement it, like sklearn in Python and Boruta in R.

Quick tip for Permutation Feature Importance: In order to have a faster and more logical way of running this, try clustered Permutation Feature Importance (https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-multicollinear-py) . Essentially, group your 138 features into several groups (by which variables are most similar), and then run permutation feat. imp. on each of the entire groups, not on individual variables.

If that's a bit too complicated advanced, more simple feature selection methods include things like forward stepwise selection (add variables one at a time), backward stepwise selection (remove variables one a time) and LASSO regression (type of regression that simultaneously finds a model and removes obviously bad variables. 138 might be too many to feed right into it, though). These are all relatively straightforward to implement, and a peruse through Google should give you a good intuition/code for how to do all of them.

Side note: There are certain algorithms (like RandomForest) that often do not benefit greatly from feature selection. So, if are familiar with that technique and don't want to fuss with feature selection, that could be an option as well.

I hope this gives you good reasoning for why feature selection might be helpful even when correlation between variables and target is low, as well as some guidance about how you can get started running some feature selection on your data.

Related Question