Solved – What classification algorithm should one use after seeing that t-SNE separates classes well

classificationtsne

Let's assume we have a classification problem and at first we want to get some insight from the data and we do t-SNE. The result of t-SNE separates classes very well. This implies that it is possible to build classification model that will also separate classes very well (if t-SNE doesn't separate well then it doesn't imply much).

Knowing that t-SNE focuses on local structure and that it can separate classes well: What are classification algorithms that should work well on this problem? Scikit suggests SVM with a Gaussian RBF kernel, but what are the others?

Best Answer

First a brief answer, and then a longer comment:

Answer

SNE techniques compute an N ×N similarity matrix in both the original data space and in the low-dimensional embedding space in such a way that the similarities form a probability distribution over pairs of objects. Specifically, the probabilities are generally given by a normalized Gaussian kernel computed from the input data or from the embedding. In terms of classification, this immediately brings to mind instance-based learning methods. You have listed one of them: SVM's with RBF, and @amoeba has listed kNN. There are also radial basis function networks, which I am not an expert on.

Comment

Having said that, I would be doubly careful about making inferences on a dataset just looking at t-SNE plots. t-SNE does not necessarily focus on the local structure. However, you can adjust it to do so by tuning the perplexity parameter, which regulates (loosely) how to balance attention between local and global aspects of your data.

In this context, perplexity itself is a stab in the dark on how many close neighbours each observation may have and is user-provided. The original paper states: “The performance of t-SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.” However, my experience is that getting the most from t-SNE may mean analyzing multiple plots with different perplexities.

In other words, tuning learning rate and perplexity, it is possible to obtain very different looking 2-d plots for the same number of training steps and using the same data.

This Distill paper How to Use t-SNE Effectively gives a great summary of the common pitfalls of t-SNE analysis. The summary points are:

  1. Those hyperparameters (e.g. learning rate, perplexity) really matter

  2. Cluster sizes in a t-SNE plot mean nothing

  3. Distances between clusters might not mean anything

  4. Random noise doesn’t always look random.

  5. You can see some shapes, sometimes

  6. For topology, you may need more than one plot

Specifically from points 2, 3, and 6 above, I would think twice about making inferences about the separability of the data by looking at individual t-SNE plots. There are many cases where you can 'manufacture' plots that show clear clusters using the right parameters.

Related Question