Solved – What are the advantages of Wasserstein distance compared to Jensen-Shannon divergence

distancedistributionsdivergencemetricwasserstein

What is the practical difference between Wasserstein metric and Jensen-Shannon divergence? Wasserstein metric is also referred to as Earth mover's distance.

From Wikipedia:

Wasserstein metric is a distance function defined between probability distributions on a given metric space M. Intuitively, if each distribution is viewed as a unit amount of earth (soil) piled on M, the metric is the minimum "cost" of turning one pile into the other, which is assumed to be the amount of earth that needs to be moved times the mean distance it has to be moved.

and

Jensen-Shannon divergence is a method of measuring the similarity between two probability distributions. It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen-Shannon distance.

I've seen JSD used in machine learning, but not so much Wasserstein metric even though it improves GAN. Is there a good guideline on when to use one or the other?

Best Answer

Following examples by Arjovsky et al (2017) and Kolouri et al (2018), Kolouri et al (2019) shows a simple example in the supplementary material comparing the Jensen-Shannon divergence with the Wasserstein distance.

enter image description here

As can be seen the JS divergence fails to provide a useful gradient when the distributions are supported on non-overlapping domains.