Solved – Why Kullback-Leibler in Stochastic Neighbor Embedding

kullback-leiblertsne

Stochastic Neighbor Embedding (and t-SNE) relies on Kullback-Leibler divergence between the point distributions in the original and the low-dimensional space. Why? Why not any other dissimilarity measure (Wasserstein, Jensen-Shannon, Kolmogorov-Smirnov…) The authors, Hinton and Roweis, simply state:

The aim of the embedding is to match these two distributions as well as possible. This is achieved by minimizing a cost function which is a sum of Kullback-Leibler divergences between the original and induced distributions over neighbors for each object

without giving a justification.

Best Answer

Dimensionality reduction techniques are often motivated by finding new representations of the data to discover hidden variables or to discover structure. The aim of SNE is to take a different approach (compared to PCA for example) by preserving local structures, which is done by taking advantage of KL-divergence's asymmetric properties.

Conditional probabilities as inverse distance

Looking at Eq (1) ,notice that the conditional probability can be interpreted as "inverse distance", because close points (low distance) are assigned high probabilities, and far points (high distance) are assigned low probabilities.

(Note: The inverse distance name is obviously not true in a stricter mathematical sense, because effectively a larger set of numbers $ \mathbb{R} $ are mapped to a smaller set of numbers $ [0,1] $.)

Taking advantage of assymetry in KL

Two scenarios exhibit differences compared to a symmetric cost function in Equation (2).

  1. $ p_{i|j} >> q_{i|j}$ Points that are close in high dimensional space and far in low dimensional space are penalised heavily. This is important, because this promotes the preservation of local structures
  2. $ q_{i|j} >> p_{i|j}$ Points that are far in high dimension space and close in low dimensional space are penalised less heavily. This is okay for us.

Thus, assymetric property of KL-divergence, and the definition of the conditional probability constitutes as the key idea of this dimensionality reduction technique. Below, you can see this is exactly why the other distances fail to be a good substitute.

So then, what is the problem with the other distance metrics?

The Jensen-Shannon Divergence is effectively the symmetrisation of the KL-Divergence, by

$$ JSD(P_i||Q_i) = \frac{1}{2}KL(P_i||Q_i) + \frac{1}{2} KL(Q_i || P_i) .$$

This loses exactly the property of preserving local structures, so this is not a good substitute.

The Wasserstein distance can intuitively seen as the rearranging of a histogram from one state to another state. The rearrangements are the same both ways, so the Wasserstein metric is also symmetric, and does not have this desirable property.

The Kolmogorov-Smirnov distance is nonparametric, which would imply that we don't assume a probability distribution, but in fact the structure is described in Eq (1).

Related Question