Solved – Why don’t dumthe variables have the continuous adjacent category problem in cluster analysis

binary datacategorical dataclustering

I know that if we use categorical variables in cluster analysis we would assume that the scale is continuous and we don't have this concept of distance between two adjacent categories.
But what is the difference when you use dummy variables? The zeros and ones will be used anyway to calculate the distances in the cluster analysis. In a nutshell, why don't 0's and 1's have this same issue? Any references about it?
Thanks

Best Answer

If you transform the category attribute to a 0-1 vector, you are in fact measuring the distance as "same = 0, different = 1", with no interim values. It doesn't gain you much actually, but it is at least less misleading. I strongly advise to control your results and algorithms with respect to this, as e.g. k-means will also produce "means" which are not sensible for binary attributes.

It harms less, because any two categories have the same difference. Say you have three categories, "red", "green", "blue":

category  continuous    dummy
red           0         1 0 0
green         1         0 1 0
blue          2         0 0 1

When represented using a continuous variable, the distance "blue-red" is twice as large as "blue-green". The algorithm will thus consider these to be more different! This does not happen with the dummy variables, here the distance is in fact binary. You can achieve the same effect with a trivial categorial distance function

$$\text{dist}(c_1, c_2) = \begin{cases}0 & \text{if } c_1=c_2 \\ 1 & \text{otherwise}\end{cases}$$

Related Solutions

Solved – Cluster analysis on weighted survey data with continuous and categorical variables

Some cluster algorithms can use case weights. At least, "average" (also called UPGMA) or "Ward" clustering methods can use weights. If available, you should use those weights to get non biased results. In R, you can specify weights using the member argument of the "hclust" function (in base R). The WeightedCluster library also provides some functions (such as partionning around medoids PAM and clustering quality measure) for clustering weighted data.

You can mix different types of variable (i.e. nominal, metric, ...) using the "gower" distance. In R, this distance is available in the "cluster" library using the "daisy" function.

daisy(..., metric="gower")

You can have more information about this commands by running:

?daisy
?hclust

Solved – Clustering data that has mixture of continuous and categorical variables

Spend lots of time on understanding similarity on your data.
Formalize your notion of similarity in a specialized similarity measure, designed for your particular data set (you will likely not be able to use an out-of-the-box similarity).
Use a clustering algorithm that can use arbitrary similarites, such as hierarchical clustering, DBSCAN, affinity propagation, or spectral clustering.

Best Answer

Related Solutions

Solved – Cluster analysis on weighted survey data with continuous and categorical variables

Solved – Clustering data that has mixture of continuous and categorical variables

Related Question