Solved – Cluster analysis on weighted survey data with continuous and categorical variables

categorical dataclusteringmixed type dataweighted-sampling

I am trying to perform cluster analysis on survey data where each respondent has answered several questions, some of which have categorical answers ("blue" "pink" "green" etc) and some of which have scale answers (rating from 1 to 10 etc).

My problem is that certain age groups were over-sampled and I need to weight the data collected in order to accurately reflect the current population.

Will it make a difference if I do the cluster analysis on the weighted data, and if so, how do I do cluster analysis on the weighted data?

Any advice would be much appreciated!

Thanks
Emma

Best Answer

Some cluster algorithms can use case weights. At least, "average" (also called UPGMA) or "Ward" clustering methods can use weights. If available, you should use those weights to get non biased results. In R, you can specify weights using the member argument of the "hclust" function (in base R). The WeightedCluster library also provides some functions (such as partionning around medoids PAM and clustering quality measure) for clustering weighted data.

You can mix different types of variable (i.e. nominal, metric, ...) using the "gower" distance. In R, this distance is available in the "cluster" library using the "daisy" function.

daisy(..., metric="gower")

You can have more information about this commands by running:

?daisy
?hclust
Related Question