Solved – Why don’t dumthe variables have the continuous adjacent category problem in cluster analysis

binary datacategorical dataclustering

I know that if we use categorical variables in cluster analysis we would assume that the scale is continuous and we don't have this concept of distance between two adjacent categories.
But what is the difference when you use dummy variables? The zeros and ones will be used anyway to calculate the distances in the cluster analysis. In a nutshell, why don't 0's and 1's have this same issue? Any references about it?
Thanks

Best Answer

If you transform the category attribute to a 0-1 vector, you are in fact measuring the distance as "same = 0, different = 1", with no interim values. It doesn't gain you much actually, but it is at least less misleading. I strongly advise to control your results and algorithms with respect to this, as e.g. k-means will also produce "means" which are not sensible for binary attributes.

It harms less, because any two categories have the same difference. Say you have three categories, "red", "green", "blue":

category  continuous    dummy
red           0         1 0 0
green         1         0 1 0
blue          2         0 0 1

When represented using a continuous variable, the distance "blue-red" is twice as large as "blue-green". The algorithm will thus consider these to be more different! This does not happen with the dummy variables, here the distance is in fact binary. You can achieve the same effect with a trivial categorial distance function

$$\text{dist}(c_1, c_2) = \begin{cases}0 & \text{if } c_1=c_2 \\ 1 & \text{otherwise}\end{cases}$$