Solved – Clustering data that has mixture of continuous and categorical variables

categorical dataclusteringcontinuous datar

I have data that represent some aspect of human behavior. I want to cluster it (unsupervised) into behavioral profiles of some sort. now, some of my variables are categorical (with 2 or more categories), and some are continuous (most are percentages). A few variables are even more complex in that one category has further continuous and the other one has no such additional data.

My question is about how to go about categorize this data. What are the (common?) approaches dealing with it?

I don't need code or anything, but rather some references or directions that will help me further understand how to deal with this challenge.

If you know of R functions that facilitate such analysis, that would be great, but it's not necessary.

thanks.

Best Answer

  1. Spend lots of time on understanding similarity on your data.
  2. Formalize your notion of similarity in a specialized similarity measure, designed for your particular data set (you will likely not be able to use an out-of-the-box similarity).
  3. Use a clustering algorithm that can use arbitrary similarites, such as hierarchical clustering, DBSCAN, affinity propagation, or spectral clustering.