Solved – Can we apply KL divergence to the probability distributions on different domains

information theorykullback-leiblermachine learningtsne

When I was reading the original paper of t-SNE, I had an question whether or not we can apply KL divergence to the discrete probability distributions on different domains.

In the paper, they measure the dissimilarity between two discrete (conditional) distributions on high dimensional domains and low dimensional domains by KL divergence.

However, according to the wikipedia Kullback–Leibler divergence entry, KL divergence for discrete probability distribution is defined on the same probability space. This implies the same sample space must be used for the probability distribution.

Can we apply KL divergence to the probability distributions on different domains?

Best Answer

KL divergence is only defined for distributions that are defined on the same domain.

In t-SNE, KL divergence is not computed between data distributions in the high- and low-dimensional spaces (this would be undefined, as above). Rather, the distributions of interest are based on neighbor probabilities. The probability that two data points are neighbors is a function their proximity, which is measured in either the high- or low-dimensional space. This yields two neighbor distributions (one for each space). The neighbor distributions are not defined on the high/low-dimensional spaces themselves, but on pairs of points in the dataset. Because these distributions are defined on the same domain, it's possible to compute the KL divergence between them. t-SNE seeks an arrangement of points in the low dimensional space that minimizes the KL divergence.