Solved – Maximum Mean Discrepancy (distance distribution)

distancedistributionsdomain-adaptationfeature-engineeringmachine learning

I have two data sets (source and target data) which follow different distributions. I am using MMD – that is a non-parametric distribution distance – to compute marginal distribution between the source and target data.

source data, Xs

target data, Xt

adaptation Matrix A

**Projected data, Zs = A'Xs and Zt = A'Xt

*MMD => Distance(P(Xs),P(Xt)) = | mean(A'Xs) – mean(A'Xt) |

That means: the distribution's distance between the source and target data in the original space is equivalent to the distance between means of projected source and target data in the embedded space.

I have a question about the concept of MMD.

In the MMD formula, why with computing distance in the latent space we could measure the distribution's distance in the original space?

Thanks

Best Answer

It might help to give slightly more of an overview of MMD.$\DeclareMathOperator{\E}{\mathbb E}\newcommand{\R}{\mathbb R}\newcommand{\X}{\mathcal X}\newcommand{\h}{\mathcal H}\DeclareMathOperator{\MMD}{MMD}$

In general, MMD is defined by the idea of representing distances between distributions as distances between mean embeddings of features. That is, say we have distributions $P$ and $Q$ over a set $\X$. The MMD is defined by a feature map $\varphi : \X \to \h$, where $\mathcal H$ is what's called a reproducing kernel Hilbert space. In general, the MMD is $$ \MMD(P, Q) = \lVert \E_{X \sim P}[ \varphi(X) ] - \E_{Y \sim Q}[ \varphi(Y) ] \rVert_\h .$$

As one example, we might have $\X = \h = \R^d$ and $\varphi(x) = x$. In that case: \begin{align} \MMD(P, Q) &= \lVert \E_{X \sim P}[ \varphi(X) ] - \E_{Y \sim Q}[ \varphi(Y) ] \rVert_\h \\&= \lVert \E_{X \sim P}[ X ] - \E_{Y \sim Q}[ Y ] \rVert_{\R^d} \\&= \lVert \mu_P - \mu_Q \rVert_{\R^d} ,\end{align} so this MMD is just the distance between the means of the two distributions. Matching distributions like this will match their means, though they might differ in their variance or in other ways.

Your case is slightly different: we have $\mathcal X = \mathbb R^d$ and $\mathcal H = \mathbb R^p$, with $\varphi(x) = A' x$, where $A$ is a $d \times p$ matrix. So we have \begin{align} \MMD(P, Q) &= \lVert \E_{X \sim P}[ \varphi(X) ] - \E_{Y \sim Q}[ \varphi(Y) ] \rVert_\h \\&= \lVert \E_{X \sim P}[ A' X ] - \E_{Y \sim Q}[ A' Y ] \rVert_{\R^p} \\&= \lVert A' \E_{X \sim P}[ X ] - A' \E_{Y \sim Q}[ Y ] \rVert_{\R^p} \\&= \lVert A'( \mu_P - \mu_Q ) \rVert_{\R^p} .\end{align} This MMD is the difference between two different projections of the mean. If $p < d$ or the mapping $A'$ otherwise isn't invertible, then this MMD is weaker than the previous one: it doesn't distinguish between some distributions that the previous one does.

You can also construct stronger distances. For example, if $\X = \R$ and you use $\varphi(x) = (x, x^2)$, then the MMD becomes $\sqrt{(\E X - \E Y)^2 + (\E X^2 - \E Y^2)^2}$, and can distinguish not only distributions with different means but with different variances as well.

And you can get much stronger than that: if $\varphi$ maps to a general reproducing kernel Hilbert space, then you can apply the kernel trick to compute the MMD, and it turns out that many kernels, including the Gaussian kernel, lead to the MMD being zero if and only the distributions are identical.

Specifically, letting $k(x, y) = \langle \varphi(x), \varphi(y) \rangle_\h$, you get \begin{align} \MMD^2(P, Q) &= \lVert \E_{X \sim P} \varphi(X) - \E_{Y \sim Q} \varphi(Y) \rVert_\h^2 \\&= \langle \E_{X \sim P} \varphi(X), \E_{X' \sim P} \varphi(X') \rangle_\h + \langle \E_{Y \sim Q} \varphi(Y), \E_{Y' \sim Q} \varphi(Y') \rangle_\h - 2 \langle \E_{X \sim P} \varphi(X), \E_{Y \sim Q} \varphi(Y) \rangle_\h \\&= \E_{X, X' \sim P} k(X, X') + \E_{Y, Y' \sim Q} k(Y, Y') - 2 \E_{X \sim P, Y \sim Q} k(X, Y) \end{align} which you can straightforwardly estimate with samples.


Update: here's where the "maximum" in the name comes from.

The feature map $\varphi: \X \to \h$ maps into a reproducing kernel Hilbert space. These are spaces of functions, and satisfy a key property (called the reproducing property): $\langle f, \varphi(x) \rangle_\h = f(x)$ for any $f \in \h$.

In the simplest example, $\X = \h = \R^d$ with $\varphi(x) = x$, we view each $f \in \h$ as the function corresponding to some $w \in \R^d$, by $f(x) = w' x$. Then the reproducing property $\langle f, \varphi(x) \rangle_\h = \langle w, x \rangle_{\R^d}$ should make sense.

In more complex settings, like a Gaussian kernel, $f$ is a much more complicated function, but the reproducing property still holds.

Now, we can give an alternative characterization of the MMD: \begin{align} \MMD(P, Q) &= \lVert \E_{X \sim P}[\varphi(X)] - \E_{Y \sim Q}[\varphi(Y)] \rVert_\h \\&= \sup_{f \in \h : \lVert f \rVert_\h \le 1} \langle f, \E_{X \sim P}[\varphi(X)] - \E_{Y \sim Q}[\varphi(Y)] \rangle_\h \\&= \sup_{f \in \h : \lVert f \rVert_\h \le 1} \langle f, \E_{X \sim P}[\varphi(X)] \rangle_\h - \langle f, \E_{Y \sim Q}[\varphi(Y)] \rangle_\h \\&= \sup_{f \in \h : \lVert f \rVert_\h \le 1} \E_{X \sim P}[\langle f, \varphi(X)\rangle_\h] - \E_{Y \sim Q}[\langle f, \varphi(Y) \rangle_\h] \\&= \sup_{f \in \h : \lVert f \rVert_\h \le 1} \E_{X \sim P}[f(X)] - \E_{Y \sim Q}[f(Y)] .\end{align} The second line is a general fact about norms in Hilbert spaces: $\sup_{f : \lVert f \rVert \le 1} \langle f, g \rangle_\h = \lVert g \rVert$ is achieved by $f = g / \lVert g \rVert$. The fourth depends on a technical condition known as Bochner integrability but is true e.g. for bounded kernels or distributions with bounded support. Then at the end we use the reproducing property.

This last line is why it's called the "maximum mean discrepancy" – it's the maximum, over test functions $f$ in the unit ball of $\h$, of the mean difference between the two distributions.