Solved – Why are mixed data a problem for euclidean-based clustering algorithms

clusteringdimensionality reductiondistancemixed type dataself organizing maps

Most classical clustering and dimensionality reduction algorithms (hierarchical clustering, principal component analysis, k-means, self-organizing maps…) are designed specifically for numeric data, and their input data are seen as points in a euclidean space.

This is a problem of course, as many real-world questions involve data that are mixed: for instance if we study buses, the height and length and motor size will be numbers, but we might also be interested in color (categorical variable: blue/red/green…) and capacity classes (ordered variable: small/medium/large capacity). Specifically, we might want to study these different types of variables simultaneously.

There are a number of methods to extend classical clustering algos to mixed data, for instance using a Gower dissimilarity to plug into hierarchical clustering or multidimensional scaling, or other methods that take a distance matrix as input. Or for instance this method, an extension of SOM to mixed data.

My question is: why can't we just use the euclidean distance on mixed variables? or why is it bad to do so? Why can't we just dummy-encode the categorical variables, normalize all variables so that they have a similar weight in the distance between observations, and run the usual algos on these matrices?

It's really easy, and never done, so I suppose it's very wrong, but can anyone tell me why? And/or give me some refs? Thanks

Best Answer

It's not about not being able to compute something.

Distances much be used to measure something meaningful. This will fail much earlier with categorial data. If it ever works with more than one variable, that is...

If you have the attributes shoe size and body mass, Euclidean distance doesn't make much sense either. It's good when x,y,z are distances. Then Euclidean distance is the line of sight distance between the points.

Now if you dummy-encode variables, what meaning does this yield?

Plus, Euclidean distance doesn't make sense when your data is discrete.

If there only exist integer x and y values, Euclidean distance will still yield non-integer distances. They don't map back to the data. Similarly, for dummy-encoded variables, the distance will not map back to a quantity of dummy variables...

When you then plan to use e.g. k-means clustering, it isn't just about distances, but about computing the mean. But there is no reasonable mean on dummy-encoded variables, is there?

Finally, there is the curse of dimensionality. Euclidean distance is known to degrade when you increase the number of variables. Adding dummy-encoded variables means you lose distance contrast quite fast. Everything is as similar as everything else, because a single dummy variable can make all the difference.

Related Question