Solved – Using the self-organizing map for sequences of categorical data

categorical dataclusteringself organizing mapssvmunsupervised learning

I have a number of vectors of categorical data (ex. {'re','ty','cf', …} ) and I want to perform an unsupervised learning on them. I came across self-organizing map technique and I am trying to see if I can apply this method on my data. The vectors that I have are not similar in size and I was thinking if the self-organizing map can be applied on my data. Is there any version of this method that can do so?

Best Answer

I will make an assumption here about your data and give you a possible answer that you might use to get started on your problem. The assumption is that your vectors of categorical data indicate, for each data point, the categories (or you might call them features) that are relevant for that data point.

If this is the case, then here's my suggestion, which uses techniques from natural language processing for inspiration.

First, go over your entire dataset and compute the set of $k$ unique features that are present in any of your data points. This will give you a (possibly very large) "vocabulary" $V = \{v_1, \dots, v_k\}$ where $v_i$ is a feature like 're', 'cf', 'ty', etc. Then, for each of your data points $p$, compute a binary feature vector $f_p = <\mathbb{1}[v_1\in p], \mathbb{1}[v_2\in p], \dots, \mathbb{1}[v_k\in p]>$. (Here, the indicator function $\mathbb{1}[x]$ is 0 if $x$ is false and 1 otherwise.) That is, convert data point $p$ into a vector of $k$ 0s and 1s such that the $i$th element of that vector indicates whether feature $v_i$ is present for data point $p$.

Once you've normalized the lengths of your data points like this, you can use a SOM with a Hamming distance (or some other bitwise distance) to do clustering with your data.