Solved – Can t-SNE be directly used as a clustering algorithm

I've been working on a dataset about a few millions commuters and their travel patterns with around 50 dimensions. And I'm applying t-SNE to the dataset.

My initial goal of applying t-SNE was to visually confirm the clustering potential of the dataset, i.e. the presence of natural structures in the dataset. Fortunately, t-SNE consistently outputs 3 distinct clusters in all runs. Even better, the t-SNE clusters make business sense when I plugged the original commuter characteristics in to interpret.

But when I research about the applications of t-SNE, few people seem to discuss about using it as a clustering algorithm (Of course, the clusters need to be assigned with the help of human annotator looking at the t-SNE plots). There are many articles about how to apply t-SNE for data visualization and dimensionality reduction.

Discussion in Clustering on the output of t-SNE dismissed t-SNE as lacking 'a concise, intuitive description of what objective the cluster assignment algorithm minimizes'. Can someone please help elaborate on this comment, and whether it is right to apply t-SNE as a clustering algorithm?

Best Answer

There are reasons why t-sne is not used as a clustering algorithm.

First, as you point out yourself, that t-sne does not generate any cluster assignments. Instead, it performs dimensionality reduction, embedding the data into a low dimensional space that is easy to visualize. You could, of course, use a standard clustering algorithm such as k-means on this embedding to get clusters. However, if the clusters exist in the data, you should not need to map it to 2D first.

Secondly, and this is quite crucial, t-sne may create embedding containing clusters that don't really exist in the real data. Also, it may disregard clusters that do exist in the real data. Depending on the randomness in the algorithm and chosen hyperparameters, you may get very different results. I recommend reading the article How to Use t-SNE Effectively discussing some of these unexpected scenarios. Another, related issue is reproducibility of t-sne results.

Finally, if there are real clusters in your data, you should be able to find them using standard, well understood clustering algorithms. Using methods that people understand gives way more credibility to your results and makes it easier to interpret them.

Best Answer

Related Solutions

Solved – Can sub-optimality of various hierarchical clustering methods be assessed or ranked

Solved – Data standardization vs. normalization for clustering analysis

Related Question