Solved – Preprocessing survey data for clustering

clusteringcorrespondence-analysisdata preprocessingsimilaritiessurvey

I want to find 4-10 clusters in survey data with 100 questions answered by 2000 individuals using a technique such as K-means or Gaussian Mixture Models. There is no response variable so the clustering technique needs to be unsupervised. The outcome of the clustering highly depends on the way the data is represented and preprocessed, causing me to identify multiple potential issues.

The responses to the questions can be either ordinal with M possible values or categorical with N possible values. M will be either 3 or 10, whereas N can be anything in the interval [2,8]. All questions have been answered by all users.

Below are the issues that I would like to get input on:

  1. How should find the similarity between responses when I have these different kinds of variables? I would assume that the best way to input the ordinal information would be as a single variable/dimension, whereas a categorical variable will be represented using N variables/dimensions. Should I scale the variables in a particular way to ensure equal weight to each question or is there some similarity measure I could use that fits this kind of data?

  2. The questions may be subgrouped categories such as "Sociodemographic Information", "Communication Assessment", "Product Assessment" and "Brand Assessment". Suppose that "Communication Assessment" contains 80% of the questions, but we would still like to get a clustering result that uses equal amount of information from each category. How should I preprocess the data to get this result?

  3. The surveys are sent out to a representative sample of the population. However, the responses are biased such that males are overrepresented and females are underrepresented. The difference is statistically significant. How should I include this information in the clustering process? Should I oversample users to get a representative dataset across gender, age, job type? The number of duplicate users created this way would probably be 10-15% of the total.

You should feel free to address other issues with clustering survey data that I have not identified.

Best Answer

A few suggestions:

Wrt (#1), you have a mixture of scales and information. Therefore, K-means (which assumes continuously distributed information only) is not an appropriate technique. Depending on the mixture of scale types you may want to consider traditional PCA (for continuous data only) vs an approach that integrates an unsupervised mixture of scales such as correspondence analysis or latent class PCA.

Latent class models are another approach to clustering but are supervised and presume the existence of a response or target variable, which you may not have available.

In terms of your concern with (#2) "equal weighting" and pre-processing of the various features in your data (e.g., communication), you can create a separate dimension for each one either using judgement or based on a principal components analysis to collapse and combine the features appropriate for each.

To obtain results that are weighted and representative, I wouldn't oversample. Consider developing a weight factor instead. The general rule of thumb is not to weight the algorithms while factoring or clustering your data, but to weight the final results of those algorithms. There are lots of ways to weight your data, one approach is to use an IPF (iterative proportional fitting) model based on the marginals of each factor used.