Difference between Hellinger Distance and Wasserstein Distance between Two Distributions

measure-theoryoperations researchoptimal-transportprobability distributionsprobability theory

I really want to understand the difference between Hellinger Distance and Wasserstein Distance. I am from a Physics background. I am expecting an intuitive explanation for the difference. Is Wasserstein distance give more pieces of information than Hellinger distance? In what way, Hellinger distance is different from Wasserstein distance? Is there any relation between Hellinger distance and Wasserstein distance between two distributions?

Best Answer

Since you asked for an intuitive explanation this is going to be somewhat imprecise, but hopefully helpful.

The Hellinger distance is a bounded metric where you're kind of looking at the cumulative difference in density (of two probability measures), over all points in a probability space.

So lets say we have two probability measures with densities $f$ and $g$, where $f$ has support on $[0,1]$ and g has support on $[x, x+1]$ with $x\in\mathbb{R}$.

For $x=0$ the Hellinger and any Wasserstein distance between our distributions is zero. As we increase $x$, the Hellinger distance increases until $x>1$, then it just stays $1$. The Wasserstein distance on the other hand keeps on increasing as $x$ increases.

The Wasserstein distance can be intuitively seen as mass times the distance you displace the mass.

If this explanation is too handwavy, I can be more specific!

Related Solutions

[Math] Closed-form analytical solutions to Optimal Transport/Wasserstein distance

Although a bit old, this is indeed a good question. Here is my bit on the matter:

Regarding Gaussian Mixture Models: A Wasserstein-type distance in the space of Gaussian Mixture Models, Julie Delon and Agnes Desolneux, https://arxiv.org/pdf/1907.05254.pdf
Using the 2-Wasserstein metric, Mallasto and Feragen geometrize the space of Gaussian processes with $L_2$ mean and covariance functions over compact index spaces: Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes, Anton Mallasto, Aasa Feragen https://papers.nips.cc/paper/7149-learning-from-uncertain-curves-the-2-wasserstein-metric-for-gaussian-processes.pdf
Wasserstein space of elliptical distributions are characterized by Muzellec and Cuturi. Authors show that for elliptical probability distributions, Wasserstein distance can be computed via a simple Riemannian descent procedure: Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions, Boris Muzellec and Marco Cuturi https://arxiv.org/pdf/1805.07594.pdf (Not closed form)
Tree metrics as ground metrics yield negative definite OT metrics that can be computed in a closed form. Sliced-Wasserstein distance is then a particular (special) case (the tree is a chain): Tree-Sliced Variants of Wasserstein Distances, Tam Le, Makoto Yamada, Kenji Fukumizu, Marco Cuturi https://arxiv.org/pdf/1902.00342.pdf
Sinkhorn distances/divergences (Cuturi, 2013) are now treated as new forms of distances (e.g. not approximations to $\mathcal{W}_2^2$) (Genevay et al, 2019). Recently, this entropy regularized optimal transport distance is found to admit a closed form for Gaussian measures: Janati et al (2020). This fascinating finding also extends to the unbalanced case.

I would be happy to keep this list up to date and evolving.

[Math] 2-Wasserstein distance between empirical distributions

I solved the problem using the MOSEK solver and got the similar results.

K = 600 
D=25 
train = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*1),size=K).T 
test = np.random.multivariate_normal(np.zeros(D), np.diag(np.ones(D)*4),size=K).T 
M = np.zeros((K,K)) 
for i in range(K):
        for j in range(K):
            M[i,j] = np.linalg.norm(train[:,i]-test[:,j],2) 
Tra = cp.Variable((K,K),nonneg=True) 
constraint = [Tra @ np.ones((K,1)) == np.ones((K,1))/K,
                  Tra.T @ np.ones((K,1)) == np.ones((K,1))/K] 
optprob = cp.Problem(cp.Minimize(cp.trace(Tra.T@M)), constraint) 
optprob.solve(solver=cp.MOSEK) 
print(np.trace(Tra.value.T@M))

The result is around 8.3 for 25 dimension which is not surprising since for the continuous random variables, the Wasserstein distance is a continous optimization problem over all the possible joint probability distributions; while for the empirical distribution, we in fact optimize over matirx $T$, which obviously has a higher optimal objective value than the continous case.

Best Answer

Related Solutions

[Math] Closed-form analytical solutions to Optimal Transport/Wasserstein distance

[Math] 2-Wasserstein distance between empirical distributions

Related Question