Solved – technical issues regarding to cluster analysis

categorical dataclusteringdimensionality reduction

Hi I would like to seek help with my cluster analysis using SAS.

The main objective of the task is to segment customers into groups based on their similarity. The dataset contain mixed types of variables including continuous (like age, income, spendings,etc), ordinal (like education, etc) and nominal (gender, occupation,etc).

  1. I simply use VARCLUS procedure in SAS to select variable to be used in developing clusters. Since it only applies for continuous variable, I assign numbers to ordinal variables and convert nominal variables to binary before running VARCLUS. I wonder what are alternatives way to deal with categorical variables.

  2. Is it better to apply principal component analysis/factor analysis for variable reduction in this case. And which option is better?

  3. What are possible ways If I want to put more weights on certain attributes in developing clusters?

  4. By using difference variables, difference distance measures and different cluster algorithm, the clusters could be varied very much. Currently, I just simply compare my cluster results from a business use manner and choose the one looks like most meaningful. I wonder whether there are methods that help in comparing and validate the result of clustering.

Best Answer

  • Firstly, asses the requirement of normalizing your continues data. Practice has shown that when numeric x-data values are normalized, training is more efficient which leads to a better predictor. You can use any of below depending on your model assumptions.

    • Gaussian normalization i.e., v' = (v - mean) / std dev.
    • Z-score
    • Min - Max method
    • Box Cox power transformation
  • You are right that dummy coding your categorical variable is required for PROC VARCLUS as the procedure uses either "R2", "pearson correlation" as the distance function to do clustering. Those statistics can only be applied to numeric vars. If discrete data is not handled carefully there is a high chance that the clustering algorithms ends up discovering the discreteness of your data, instead of a sensible structure. Consider rank ordering the variables basis some business justification where possible, for example occupation can be ranked basis corresponding avg salary.

  • If you want to specify relative weights for each observation in the input data set, place the weights in a variable in the data set and specify the name in a WEIGHT statement i.e., WEIGHT variables ;.

    However for your point number 2 and 3 of question a better approach might be to consider rank ordering the predictive power of all variables by their Information Value (IV) and Weight of Evidence (WOE). You can find the SAS macro and paper here!. One advantage of this program is that continuous, ordinal and categorical variables can be assessed together.

  • Here are few links describing cluster validation techniques

If you largely categorical variable may be you should consider hierarchical clustering with appropriate distance function.