Solved – Feature selection clustering customer segmentation

data miningfeature selectionrunsupervised learning

based on customer data I want to perform a clustering using different clustering algorithms (K-Means, Expectation Maximization, etc.) in R. The most attributes were engineered pursuing the goal to be basically meaningful for a customer segmentation.

Without feature selection, the results are very poor regarding evaluation criteria like ASW, BSS, WSS, etc.

My question now is whether I need to do a feature selection technique (wrapper/filter) or just select the features I think are most valuable for segmenting the customers. I found very different sources regarding this issue. The most authors say the features have to be selected concerning the business objective. Other sources propose feature selection methods for unsupervised learning. Is that really useful for a customer segmentation or is it only needed for image segmentation for instance?

My opinion is: Attributes might be economic valuable even so not useful for the clustering process and vice versa. This would mean I select manually the features.

I performed already a PCA which resulted also in poor results regarding clustering evaluation criteria. Therefore I obviously have to select only a few attributes in order to obtain a clear and stable clustering.

Best Answer

I suggest reviewing the literature / work by the mixmod group and by Raftery's group. Both have methods for model-based clustering involving both feature selection and without feature selection. Heuristic based methods may be appropriate for your but the performance of both heuristics, and the model based methods, tend to be highly influenced by your data inputs and your data pre-processing (as below).

Typically in a business case, you have variables from many different distributions. This poses problems in mixture modeling; and, you have not specified (a) if this is (or isn't) the case in your data, and (b) (if so) how you wish to deal with it. Another concern is how knowledgeable you are about your data. How confident are you that you can actually select the most important features?

Questions

  • What types of variables do you have? What are there distributions?
  • What is there correlation structure (you mentioned poor results, without detail, from PCA)?
  • How are you pre-processing your variables?

If you provide additional detail on your data, a more complete answer can be provided.

Related Question