Solved – Clustering numeric, categorical, and multivalue categorical data

categorical dataclusteringunsupervised learning

I have data that look like this:

amount    creator    accounts
100       john       cash, accounts payable
325       jane       accounts receivable, cash
200       john       tax account, accounts payable, cash

How should these data points be clustered?

Thoughts so far:

  • Popular, consensus answer seems to be to one-hot encode the categorical and multivalue_categorical fields, and then scale the numeric field to [0,1]. This causes two primary problems: extremely sparse/high-dimensional data (4,000 dimensions in my case), and a numeric column that is perhaps not weighted appropriately.

  • Attempt to apply differing algorithms to each data type and mash them together somehow. This could involve market-basket type analysis for the multivalue_categorical, k-modes for the categorical, and k-means for numeric (or k-prototypes for the categorical and numeric).

Is there any method/implementation that would allow for these three types of data to be clustered without one-hot encoding the categorical and multivalue categorical? I have looked into SOM as an unsupervised NN that performs clustering, but I haven't seen evidence that it can handle multivalue categorical.

Best Answer

With mixed data types the basic answer is to use Gower's distance (see @ttnphns' thorough explainer here: Hierarchical clustering with mixed type data - what distance/similarity to use?). The gist of it is that you get the distance measure of your preference for each variable individually, then average them. You can also do a weighted average of the constituent distances, if you think some should be given more credibility than others.

For your continuous variable, the absolute difference should be fine. Simple matching is presumably fine for your categorical variable, creator. That is, $1$, if two rows have the same creator, and $0$ otherwise. Then you just need to find a metric for your multivalue categorical variable. I think it is fine for you to think of this as a single variable, but I suspect it is ultimately better to think of it as a set of binary variables, where all possible options constitutes the set. From there, if the option is listed, that amounts to having a $1$ in that column, and $0$ otherwise. Thus, you have a high-dimensional binary space. There have been lots of measures defined for binary data (see: Choi, Cha, & Tappert, A Survey of Binary Similarity and Distance Measures, pdf, for a list of 76!). You need to decide which makes sense. The constituent distance measures each gets normalized, and then you use whichever clustering algorithm you like that can work with a distance matrix instead of the raw data (see, e.g., my answer here: How to use both binary and continuous variables together in clustering?).

Related Question