Solved – Can we apply KL divergence to the probability distributions on different domains

information theorykullback-leiblermachine learningtsne

When I was reading the original paper of t-SNE, I had an question whether or not we can apply KL divergence to the discrete probability distributions on different domains.

In the paper, they measure the dissimilarity between two discrete (conditional) distributions on high dimensional domains and low dimensional domains by KL divergence.

However, according to the wikipedia Kullback–Leibler divergence entry, KL divergence for discrete probability distribution is defined on the same probability space. This implies the same sample space must be used for the probability distribution.

Can we apply KL divergence to the probability distributions on different domains?

Best Answer

KL divergence is only defined for distributions that are defined on the same domain.

In t-SNE, KL divergence is not computed between data distributions in the high- and low-dimensional spaces (this would be undefined, as above). Rather, the distributions of interest are based on neighbor probabilities. The probability that two data points are neighbors is a function their proximity, which is measured in either the high- or low-dimensional space. This yields two neighbor distributions (one for each space). The neighbor distributions are not defined on the high/low-dimensional spaces themselves, but on pairs of points in the dataset. Because these distributions are defined on the same domain, it's possible to compute the KL divergence between them. t-SNE seeks an arrangement of points in the low dimensional space that minimizes the KL divergence.

Related Solutions

Solved – the advantages of Wasserstein metric compared to Kullback-Leibler divergence

When considering the advantages of Wasserstein metric compared to KL divergence, then the most obvious one is that W is a metric whereas KL divergence is not, since KL is not symmetric (i.e. $D_{KL}(P||Q) \neq D_{KL}(Q||P)$ in general) and does not satisfy the triangle inequality (i.e. $D_{KL}(R||P) \leq D_{KL}(Q||P) + D_{KL}(R||Q)$ does not hold in general).

As what comes to practical difference, then one of the most important is that unlike KL (and many other measures) Wasserstein takes into account the metric space and what this means in less abstract terms is perhaps best explained by an example (feel free to skip to the figure, code just for producing it):

# define samples this way as scipy.stats.wasserstein_distance can't take probability distributions directly
sampP = [1,1,1,1,1,1,2,3,4,5]
sampQ = [1,2,3,4,5,5,5,5,5,5]
# and for scipy.stats.entropy (gives KL divergence here) we want distributions
P = np.unique(sampP, return_counts=True)[1] / len(sampP)
Q = np.unique(sampQ, return_counts=True)[1] / len(sampQ)
# compare to this sample / distribution:
sampQ2 = [1,2,2,2,2,2,2,3,4,5]
Q2 = np.unique(sampQ2, return_counts=True)[1] / len(sampQ2)

fig = plt.figure(figsize=(10,7))
fig.subplots_adjust(wspace=0.5)
plt.subplot(2,2,1)
plt.bar(np.arange(len(P)), P, color='r')
plt.xticks(np.arange(len(P)), np.arange(1,5), fontsize=0)
plt.subplot(2,2,3)
plt.bar(np.arange(len(Q)), Q, color='b')
plt.xticks(np.arange(len(Q)), np.arange(1,5))
plt.title("Wasserstein distance {:.4}\nKL divergence {:.4}".format(
    scipy.stats.wasserstein_distance(sampP, sampQ), scipy.stats.entropy(P, Q)), fontsize=10)
plt.subplot(2,2,2)
plt.bar(np.arange(len(P)), P, color='r')
plt.xticks(np.arange(len(P)), np.arange(1,5), fontsize=0)
plt.subplot(2,2,4)
plt.bar(np.arange(len(Q2)), Q2, color='b')
plt.xticks(np.arange(len(Q2)), np.arange(1,5))
plt.title("Wasserstein distance {:.4}\nKL divergence {:.4}".format(
    scipy.stats.wasserstein_distance(sampP, sampQ2), scipy.stats.entropy(P, Q2)), fontsize=10)
plt.show()

Here the measures between red and blue distributions are the same for KL divergence whereas Wasserstein distance measures the work required to transport the probability mass from the red state to the blue state using x-axis as a “road”. This measure is obviously the larger the further away the probability mass is (hence the alias earth mover's distance). So which one you want to use depends on your application area and what you want to measure. As a note, instead of KL divergence there are also other options like Jensen-Shannon distance that are proper metrics.

Solved – Can KL-Divergence ever be greater than 1

The Kullback-Leibler divergence is unbounded. Indeed, since there is no lower bound on the $q(i)$'s, there is no upper bound on the $p(i)/q(i)$'s. For instance, the Kullback-Leibler divergence between a Normal $N(\mu_1,\sigma_1^2)$ and a Normal $N(\mu_2,\sigma_1^2)$ is $$\frac{1}{2\sigma_1^{2}}(\mu_1-\mu_2)^2$$which is clearly unbounded.

Wikipedia [which has been known to be wrong!] indeed states

"...a Kullback–Leibler divergence of 1 indicates that the two distributions behave in such a different manner that the expectation given the first distribution approaches zero."

which makes no sense (expectation of which function? why 1 and not 2?)

A more satisfactory explanation from the same Wikipedia page is that the Kullback–Leibler divergence

"...can be construed as measuring the expected number of extra bits required to code samples from P using a code optimized for Q rather than the code optimized for P."

Best Answer

Related Solutions

Solved – the advantages of Wasserstein metric compared to Kullback-Leibler divergence

Solved – Can KL-Divergence ever be greater than 1

Related Question