Solved – Which unsupervised classification method can be used for categorical data

categorical dataclusteringunsupervised learning

I have a list of categorical data and I want to apply an unsupervised classification method to cluster this data.
Which method could be used?

Example:

gene1

gene2

gene3

gene4

gene5

gene6

gene7

gene8

My goal is to cluster these genes.

Could anyone suggest me how to do clustering with these labels?

Best Answer

I'm going to answer this as an approach to clustering categorical data.

The standard k-means performs poorly in case of categorical data since in the sample space is discrete. The cost function defined by k-means computes the Euclidean Distance (or something similar) which is relevant only for continuous variables. Instead of computing the Euclidean distance, one could use the Hammer Distance (for categorical) or Gower Distance (for mixed). Instead of computing the mean, one can compute the mode. The most occurring value of a nominal variable is used as its representative (centers of cluster). Such a cost function is used in a variation of k-means called k-modes. Modes are analogous to centroids in k-means.

If you are using python then you could probably use this package. The method was first presented in this paper. You can read about it's usage methods in their documentation. There's another extension of the k-modes called k-prototypes which works well for mixed datatypes (included in the python package).

Related Question