Solved – How to spot and remove unimportant columns in a dataframe

feature selectionfeature-engineeringmachine learningpandaspython

I have a pandas dataframe with several columns, for example let's say:

   A   B   C   D(labels)
   45  88  44  0
   62  34   2  1
   85  65  11  1
   74  43  42  1
   90  38  34  0
        ...
    0  94  45  1
   58  23  23  0

How can I detect which columns are useless and which columns are useful to a machine learning model?. I all ready tried several methods like PCA, removing features with low variance, univariate feature selection with a chi criteria, etc. However none of them seems to work since the performance of my classifier still is low. I also tried to create more features (add more columns to the feature matrix) and they decreased the performance of my classifier, thus is there anyway to spot which columns are useless?.

Best Answer

I'm assuming all training/CV/test performance is bad and thereby that the problem is not overfitting. In a nutshell, you then could try the following to meaningfully reduce your features:

  • Use feature correlation to reduce correlated features,
  • Feature selection techniques such as feature filters and feature wrappers,
  • Feature reduction with using techniques like PCA, or
  • Models that internally "weight" features themselves.

Things you should consider:

  • As @KeithHughitt mentioned, the problem might be that the relation you seek is simply not present in your data. In such a case it might be impossible for models to perform and generalize well. The "one perfect" solution for those cases does not exist, but, as you already mentioned, deriving features (same information but differently processed) and/or adding information (new information, e.g. with recording more features) might help.

  • Another explanation for bad predictive performance with big data/many features might be: the feature-target relation is too complex to be represented accordingly by your model (e.g. trying to model circular data with a linear model). In such cases, another option besides adding preproceesing/feature derivation would be to employ more complex models. But those usually come at the cost of increased calculation power, such as with deep learning.

Related Question