About the inputs of the Wasserstein Distance $W_1$

machine learningprobabilityprobability distributionsprobability theorypython

Introduction (this is just supporting my questions, but you can skip it and go directly to the questions).

  1. Let's consider the following Proposition from "Ramdas & Trillos(2015) On Wasserstein Two Sample Testing and Related Families of Nonparametric Tests" (at pag. 10):

Proposition 1. The $p$-Wasserstein distance between two probability measures $P$ and $Q$
on $\mathbb{R}$ with $p$-finite moments can be written as
\begin{equation}
W_{p}^{p}(P,Q)=\int_{0}^{1}\left | F^{-1}(t)-G^{-1}(t) \right |^{p}dt
\end{equation}

where $F^{-1}$ and $G^{-1}$ are the quantile functions of $P$ and $Q$ respectively.

  1. Then, if we consider the Fubini's theorem, for $p=1$, we would get (please see the Proposition. 2.17, at pag. 66 of "Santambrogio(2015) Optimal Transport for Applied Mathematicians"):
    \begin{align}
    W_{1}(P,Q) &= \int_{0}^{1}\left | F^{-1}(t)-G^{-1}(t) \right |dt \\
    &= \int_{\mathbb{R}}\left | F(x)-G(x) \right |dx
    \end{align}

    where $F$ and $G$ are the cumulative distribution functions (CDFs) of $P$ and $Q$, respectively.
    [Note: I think this last equivalence works only for $p=1$, but I am not sure (?)]

  2. Given the above mentioned background, I would like to calculate, with Python, the Wasserstein distance $W_1$, as indicated in scipy.stats.wasserstein_distance, i.e.
    $\int_{\mathbb{R}}\left | F(x)-G(x) \right |dx$. [Please bear in mind that I did not employ the same notation as in scipy.stats.wasserstein_distance, where $W_1$ would read as $\int_{-\infty}^{+\infty}\left | U-V \right |dx$.]

  3. Still in scipy.stats.wasserstein_distance, we read about the inputs:

The input distributions can be empirical, therefore coming from samples whose values are effectively inputs of the function, or they can be seen as generalized functions, in which case they are weighted sums of Dirac delta functions located at the specified values.

Questions.

In math, you calculate the Wasserstein Distance $W_1$ among two probability measures $P$ and $Q$, by using the CDFs (or the inverse CDFs) of those two probability measures $P$ and $Q$, i.e. $F$ and $G$ (or $F^{-1}$ and $G^{-1}$).

In Python (please see scipy.stats.wasserstein_distance), you use the "Values observed in the (empirical) distributions" as inputs to calculate the Wasserstein Distance $W_1$. Therefore:

  1. What are the "(empirical) distributions" mentioned in Python guidelines as inputs for calculating $W_1$? I mean, do they refer to the empirical estimations of Probability Density Functions, i.e. histograms, or to the empirical Cumulative Distribution Functions (eCDFs)?
  2. How are the inputs used in Python related to the two probability measures $P$ and $Q$?

Best Answer

The documentation is indeed poorly written.

scipy.stats.wasserstein_distance has 4 arguments. u_values and v_values are 1D arrays, say u_values $= \{u_1,\ldots,u_n\}$ and v_values $= \{v_1,\ldots,v_m\}$

The empirical distribution corresponding to u_values is $\sum_{i=1}^n \frac 1n \delta_{u_i}$ and that corresponding to v_values is $\sum_{j=1}^m \frac 1m \delta_{v_j}$ (recall that $\delta_{u_i}$ denotes a Dirac measure).

In each empirical distribution, the weights are even. You can specify other weights with the arguments u_weights and v_weights. If u_weights $= \{a_1,\ldots,a_n\}$ where the $a_i$ are positive, then let $p_i = \frac{a_i}{\sum_{k=1}^n a_k}$ for each $i$ and the corresponding distribution is
$$P = \sum_{i=1}^n p_i \delta_{u_i}.$$

Related Question