Solved – Similarity probabilities in SNE vs t-SNE

dimensionality reductionmachine learningtsne

In the original tSNE paper (van der Maaten and Hinton 2008, Visualizing Data using t-SNE) the similarity probabilities for stochastic neighbor embedding (SNE) are defined in Section 2 as

$$p_{j|i} = \frac{\exp(-||x_{i} – x_{j}||/2\sigma_{i}^{2})}{\sum_{k \neq i}{\exp(-||x_{i} – x_{k}||/2\sigma_{i}^{2})}}$$

and the probabilities for t-distributed stochastic neighbor embedding (t-SNE) are defined in Section 3 as

$$p_{ij} = \frac{\exp(-||x_{i} – x_{j}||/2\sigma^{2})}{\sum_{k \neq l}{\exp(-||x_{k} – x_{l}||/2\sigma^{2})}}.$$

According to my prior understanding original, SNE and tSNE differ only in the formula for $q_{ij}$: SNE uses Gaussian for $q_{ij}$ and tSNE uses Student's t-distribution. But the above two formulas are different as well; why?

My questions are about second formula: from where $k$ and $l$ iterators are came from? And that is $\sigma$ is it $\sigma_{i}$ or not?
Iterators $k$ and $l$ for the $q_{ij}$ are also confusing me.

Best Answer

I think the paper defines the joint distribution (not the conditional distribution!) as

$$p_{ij} = \frac{\exp(-||x_{i} - x_{j}||/2\sigma^{2})}{\sum_{k \neq l}{\exp(-||x_{k} - x_{l}||/2\sigma^{2})}},$$

but they do not use it and instead define $$p_{ij}=\frac{p_{j|i}+p_{i|j}}{2}.$$

As mentioned in the paper the original SNE and tSNE differ in two respects:

The cost function used by t-SNE differs from the one used by SNE in two ways: (1) it uses a symmetrized version of the SNE cost function with simpler gradients that was briefly introduced by Cook et al. (2007) and (2) it uses a Student-t distribution rather than a Gaussian to compute the similarity between two points in the low-dimensional space. t-SNE employs a heavy-tailed distribution in the low-dimensional space to alleviate both the crowding problem and the optimization problems of SNE.

Update based on the question edit: The denominator in both cases is just the normalization to ensure that summation over i(p(j/i) and summation over i&j(p(i,j) sum to 1, the basic requirement for both to be distributions.

Also since there is one Gaussian here, we take sigma as it's standard deviation. In the first case there were i Gaussian, and we could have taken a common standard deviation, but instead we chose to make sigma dependent on the density of neighbors around a point. If a point has a large number of neighbors around it within distance x, the conditional distribution should drop faster, as compared to conditional distribution for points in sparser regions.