Solved – Why use perplexity rather than nearest neighbor match in t-SNE

dimensionality reductioninformation theorymachine learningtsne

I am trying to work through Van der Maaten and Hinton's paper on t-SNE (which was inspired by Hinton and Roweis' SNE) and am having trouble understanding why they use a perplexity parameter.

In the t-SNE paper the authors suggest "The perplexity can be interpreted as a smooth measure of the effective number of neighbors". It is clear to me why they need to set $\sigma_i$ to different values for each i, but why complicate matters with complexity? It is easier to understand and just as quick to do a binary search for a $\sigma_i$ that results in k (user-specified) neighbors within two standard deviations of the point i.

I tested this thought on a simple case, 100 samples from a Gaussian Mixture with 3-dimensions, using a Perplexity of 20 in one case and k=6 nearest neighbors in the second case. In this case there is a near-linear relationship between the $\sigma_i$ generated by each method, as shown below:

enter image description here

Perhaps my example just wasn't complex enough? Or maybe the use of Perplexity conveys some additional information as log_2(entropy) that I just don't have intuition for?

Any insight would be appreciated

[edit]

In response to @geomatt's comment, I have run it again but dropped the dimension down to two to be able to visualize. This shows the location of points in (x1,x2) space with the resulting sigma values in black (k-nn) and red (perplexity). Aside from the constant multiple I still don't notice a huge difference

enter image description here

Best Answer

"The perplexity can be interpreted as a smooth measure of the effective number of neighbors" could be interpreted as $\frac{\delta \sigma_i}{\delta P}$ being smooth. That is, varying Perplexity has an effect on $\sigma_i$ for a fixed i that is continuous in all derivatives.

This is not true of the k-NN approach. One can imagine fixing an i that lies within a cluster containing G points. varying k from 2 ... G-1 should result in similar but monotonically increasing values of $\sigma_i$. There is a jump at k=G as the value of $\sigma$ must be large enough to reach outside the cluster. The distance between the cluster and the nearest point determines the size of this jump.

Below is a (not-too-extreme) example of this. I simulated 100 points from a Gaussian Mixture with two Gaussians of equal probability. The data was not quite linearly separable. I chose a point, i, at random and varied Perplexity and k. Note the sudden change in first derivative of the k-nn approach as we near 50 (the expected number of points that fall within a given cluster).

enter image description here

enter image description here