Solved – Feature Selection Using Principal Feature Analysis and Variables Factor Map

feature selectionpcapythonr

I am trying to select the most important features that explain the variability of my data using an unsupervised approach in python (would consider R though).

This is after I performed a PCA and looked at the PC1 and PC2, and look at the Variables Factor Map.

The Variables Factor Map tells me what is the contribution of each of the variables to the components; in this case PC1 and PC2 which carry most of the variation of the data.

If I apply a Principal Feature Selection Analysis 10,000 times and rank the features by how many times each feature is selected among the most important (top 10) features, the features that dominate the PC1 and PC2 are at the bottom half, i.e. seem to be not very important and vice versa.

If I take only the top half features and perform the PCA, it looks very similar to the original, but if I used the bottom half it looks different, i.e. the Feature Selection Analysis seems to be doing its job, however I do not understand why the features that contribute the most to the PC1 and PC2 seem to be the ones that are the least important! This seems unintuitive to me, and make me think I am missing something.

Could anyone help me to understand this issue?

Thanks.

Best Answer

I have tried a similar approach and am pretty sure that the kmeans clustering step is causing issues. The selection of features heavily depends on the random state and I'm quite sure that is the problem (at least in my case). It sounds like the feature selection is also slightly random in your case so that might be causing these weird results. Hope that helps!

Related Question