Solved – Combine two, three, (n) metrics for calculating dissimilarity matrix

clusteringmachine learningr

I have a data set with 9000 instances and 40 attributes of mixed data, that is categorical and numeric. My target is to group them into clusters using whichever clustering algorithm works best. I've heard/read that for such a data set Gower distance is suitable. My question is can I combine two (or n) metrics for calculating distances between instances, for example I would like to use let's say Euclidean distance on numeric attributes and Gower distance on categorical attributes. I could always divide my data set into two data sets, one with numeric attributes and the other with categorical. But how could one interpret each result? Summing them up just sounds … wrong.

My second question is what exactly does Gower distance do with numeric values? Does my first question even make sense?

Here is a snippet of my code, I am using R and functions daisy, agnesfrom package cluster:

df.diss <- daisy(df, metric = "gower", type = list(numeric = c(1, 4, 6, 8, 9, 11, 12, 13, 14, 17 : 37), symm = c(2, 3, 5, 7)), stand = FALSE)
df.clust <- agnes(df.diss)

Using these functions or even R is not a must.

Best Answer

If you look at Gower in detail, you'll notice it uses Manhattan on numerical attributes. You can easily modify it to use Euclidean.

However, feature weighting will have a major impact on the results. There are a few approaches for supervised weighting of features IIRC, but I have not yet seen anything reliable for automatic weighting that does not require labels.

So in the end, you will have the problem that your distance function looks something (for Euclidean) like this:

$$ d(x,y) = \sqrt{\sum_{i\in \text{numerical}} \omega_i(x_i-y_i)^2} + \sum_{i \in \text{categorical}} \omega_i \mathbb{1}_{x_i == y_i} $$

where you will face the challenge of choosing all the $\omega_i$ weights.