Clustering – How to Determine Optimal Number of Clusters

clustering

For a multilabel dataset, i would like to find the number of clusters involved in it. The below example gives more details about the problem:

Label_A: feature values
Label_B: feature values
Label_A, Label_C: feature values 
Label_C: feature values ... etc

We have say $n$ datarecord. Label field may have single label/multilabel(as in the case of record 3).

I would like to determine the number of cluster involved in the data. Assuming number of label as the number of cluster results in bad accuracy. This is because there may be case where single label can have multiple cluster. In this case, if we can find more cluster and assign two or more cluster to same label, we can increase the accuracy.

Hence, how do you find the number of cluster present in the multilabel data?

Best Answer

You can convert the labels into features indicating if the label is present or not. After that you can use various clustering algorithms and their corresponding methods to find out the number of clusters.

EDIT: I understood that your difficulty was handling the multiple labels and I suggested a solution for that. Your question did not mention that you wanted to use the k-means algorithm. The number of k-means clusters question has been answered here: How to define number of clusters in K-means clustering?. For hierarchical clustering the answer is here: Where to cut a dendrogram?. But there are many other clustering methods available: Choosing a clustering method.

Related Question