Solved – Clustering on the output of t-SNE

clusteringinterpretationk-meanstsne

I've got an application where it'd be handy to cluster a noisy dataset before looking for subgroup effects within the clusters. I first looked at PCA, but it takes ~30 components to get to 90% of the variability, so clustering on just a couple of PC's will throw away a lot of information.

I then tried t-SNE (for the first time), which gives me an odd shape in two dimensions that is very amenable to clustering via k-means. What's more, running a random forest on the data with the cluster assignment as the outcome shows that the clusters have a fairly sensible interpretation given the context of the problem, in terms of the variables that make up the raw data.

But if I'm going to report on these clusters, how do I describe them? K-means clusters on principal components reveal individuals who are nearby to one another in terms of the derived variables that comprise X% of the variance in the dataset. What equivalent statement can be made about t-SNE clusters?

Perhaps something to the effect of:

t-SNE reveals approximate contiguity in an underlying high-dimensional manifold, so clusters on the low-dimensional representation of the high-dimensional space maximize the "likelihood" that contiguous individuals will not be in the same cluster

Can anyone propose a better blurb than that?

Best Answer

The problem with t-SNE is that it does not preserve distances nor density. It only to some extent preserves nearest-neighbors. The difference is subtle, but affects any density- or distance based algorithm.

While clustering after t-SNE will sometimes (often?) work, you will never know whether the "clusters" you find are real, or just artifacts of t-SNE. You will not be able to explain the clusters. You may just be seeing 'shapes in clouds'.

To see this effect, simply generate a multivariate Gaussian distribution. If you visualize this, you will have a ball that is dense and gets much less dense outwards, with some outliers that can be really far away.

Now run t-SNE on this data. You will usually get a circle of rather uniform density. If you use a low perplexity, it may even have some odd patterns in there. But you cannot really tell apart outliers anymore.

Now lets make things more complicated. Let's use 250 points in a normal distribution at (-2,0), and 750 points in a normal distribution at (+2,0).

Input data

This is supposed to be an easy data set, for example with EM:

EM clustering

If we run t-SNE with default perplexity of 40, we get an oddly shaped pattern:

t-SNE p=40

Not bad, but also not that easy to cluster, is it? You will have a hard time finding a clustering algorithm that works here exactly as desired. And even if you would ask humans to cluster this data, most likely they will find much more than 2 clusters here.

If we run t-SNE with a too small perplexity such as 20, we get more of these patterns that do not exist:

t-SNE p=20

This will cluster e.g. with DBSCAN, but it will yield four clusters. So beware, t-SNE can produce "fake" patterns!

The optimum perplexity appears to be somewhere around 80 for this data set; but I don't think this parameter should work for every other data set.

t-SNE p=80

Now this is visually pleasing, but not better for analysis. A human annotator could likely select a cut and get a decent result; k-means however will fail even in this very very easy scenario! You can already see that density information is lost, all data seems to live in area of almost the same density. If we would instead further increase the perplexity, the uniformity would increase, and the separation would reduce again.

In conclusions, use t-SNE for visualization (and try different parameters to get something visually pleasing!), but rather do not run clustering afterwards, in particular do not use distance- or density based algorithms, as this information was intentionally (!) lost. Neighborhood-graph based approaches may be fine, but then you don't need to first run t-SNE beforehand, just use the neighbors immediately (because t-SNE tries to keep this nn-graph largely intact).

More examples

These examples were prepared for the presentation of the paper (but cannot be found in the paper yet, as I did this experiment later)

Erich Schubert, and Michael Gertz.
Intrinsic t-Stochastic Neighbor Embedding for Visualization and Outlier Detection – A Remedy Against the Curse of Dimensionality?
In: Proceedings of the 10th International Conference on Similarity Search and Applications (SISAP), Munich, Germany. 2017

First, we have this input data:

Fish

As you may guess, this is derived from a "color me" image for kids.

If we run this through SNE (NOT t-SNE, but the predecessor):

SNE fish

Wow, our fish has become quite a sea monster! Because the kernel size is chosen locally, we lose much of the density information.

But you will be really surprised by the output of t-SNE:

t-SNE fish

I have actually tried two implementations (the ELKI, and the sklearn implementations), and both produced such a result. Some disconnected fragments, but that each look somewhat consistent with the original data.

Two important points to explain this:

  1. SGD relies on an iterative refinement procedure, and may get stuck in local optima. In particular, this makes it hard for the algorithm to "flip" a part of the data that it has mirrored, as this would require moving points through others that are supposed to be separate. So if some parts of the fish are mirrored, and other parts are not mirrored, it may be unable to fix this.

  2. t-SNE uses the t-distribution in the projected space. In contrast to the Gaussian distribution used by regular SNE, this means most points will repel each other, because they have 0 affinity in the input domain (Gaussian gets zero quickly), but >0 affinity in the output domain. Sometimes (as in MNIST) this makes nicer visualization. In particular, it can help "splitting" a data set a bit more than in the input domain. This additional repulsion also often causes points to more evenly use the area, which can also be desirable. But here in this example, the repelling effects actually cause fragments of the fish to separate.

We can help (on this toy data set) the first issue by using the original coordinates as initial placement, rather than random coordinates (as usually used with t-SNE). This time, the image is sklearn instead of ELKI, because the sklearn version already had a parameter to pass initial coordinates:

Fish, t-SNE, with original coordinates as initialization

As you can see, even with "perfect" initial placement, t-SNE will "break" the fish in a number of places that were originally connected because the Student-t repulsion in the output domain is stronger than the Gaussian affinity in the input space.

As you can see, t-SNE (and SNE, too!) are interesting visualization techniques, but they need to be handled carefully. I would rather not apply k-means on the result! because the result will be heavily distorted, and neither distances nor density are preserved well. Instead, rather use it for visualization.