Solved – How to determine which variables to be used for cluster analysis

clusteringfeature selectionhierarchical clustering

I have about 10 variables (features) and want to do cluster analysis of cases (data points). I have a number of ideas about which variables to be included for cluster analysis:

  1. Plot the variables pairwise in scatter plots and see if there are rough groups by some of the variables;

  2. Do factor analysis or PCA and combine those variables which are similar (correlated) ones.

  3. Use all of the variables in clustering, and after cluster analysis use ANOVA (or similar group comparison technique) to test if there is difference between the clusters, and delete those variables by which there's no significant differences among clusters, and then run clustering again, and test again.

Are there better ways to decide which variables to be used for cluster analysis?

Thank you for your input.

Best Answer

https://www.researchgate.net/profile/Federico_Marini/publication/230276990_Finding_relevant_clustering_directions_in_highdimensional_data_using_Particle_Swarm_Optimization/links/550c0b570cf20637993960f2.pdf

This paper describes how you can find optimal clustering directions using particle swarm optimization. This algorithm uses binary-PSO (BPSO). The MatLab of BPSO is available on MatLab Central. You can modify your cost function as defined in this paper.