Solved – Distance between two Gaussian mixtures to evaluate cluster solutions

clusteringgaussian mixture distributionkullback-leibler

I'm running a quick simulation to compare different clustering methods, and currently hit a snag trying to evaluate the cluster solutions.

I know of various validation metrics (many found in cluster.stats() in R), but I assume those are best used if the estimated number of clusters actually equals the true number of clusters. I want to maintain the ability to measure how well a clustering solution performs when it doesn't specify the correct number of clusters in the original simulation (i.e., how well does a three cluster solution model data that were simulated to have a 4-cluster solution). Just for your information, clusters are simulated to possess identical covariance matrices.

I thought KL divergence between two mixtures of Gaussians would be useful to implement, but no closed form solution exists (Hershey and Olson (2007)) and implementing a Monte Carlo simulation is starting to be computationally expensive.

Are there any other solutions that might be easy to implement (even if just an approximation)?

Best Answer

Suppose we have two Gaussian mixtures in $\mathbb R^d$:$\DeclareMathOperator{\N}{\mathcal N} \newcommand{\ud}{\mathrm{d}} \DeclareMathOperator{\E}{\mathbb E} \DeclareMathOperator{\MMD}{\mathrm{MMD}}$ $$ P = \sum_{i=1}^{n} \alpha_i P_i = \sum_{i=1}^n \alpha_i \N(\mu_i, \Sigma_i) \qquad Q = \sum_{j=1}^m \beta_j Q_j = \sum_{j=1}^m \N(m_j, S_j) .$$ Call their densities $p(\cdot)$ and $q(\cdot)$, respectively, and denote the densities of their components $P_i$, $Q_j$ by $p_i(x) = \N(x; \mu_i, \Sigma_i)$, $q_j(x) = \N(x; m_j, S_j)$.

The following distances are available in closed form:

  • $L_2$ distance, as suggested in a comment by user39665. This is: \begin{align} L_2(P, Q)^2 &= \int (p(x) - q(x))^2 \,\ud x \\&= \int \left( \sum_{i} \alpha_i p_i(x) - \sum_j \beta_j q_j(x) \right)^2 \ud x \\&= \sum_{i,i'} \alpha_i \alpha_{i'} \int p_i(x) p_{i'}(x) \ud x + \sum_{j,j'} \beta_j \beta_{j'} \int q_j(x) q_{j'}(x) \ud x \\&\qquad - 2 \sum_{i,j} \alpha_i \beta_j \int p_i(x) q_j(x) \ud x .\end{align} Note that, as seen for example in section 8.1.8 of the matrix cookbook: \begin{align} \int \N(x; \mu, \Sigma) \N(x; \mu', \Sigma') \,\ud x &= \N(\mu; \mu', \Sigma + \Sigma') \end{align} so this can be evaluated easily in $O(m n)$ time.

  • The maximum mean discrepancy (MMD) with a Gaussian RBF kernel. This is a cool distance, not yet super-well-known among the statistics community, that takes a bit of math to define.

    Letting $$k(x, y) := \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right),$$ define the Hilbert space $\mathcal{H}$ as the reproducing kernel Hilbert space corresponding to $k$: $k(x, y) = \langle \varphi(x), \varphi(y) \rangle_{\mathcal H}$.

    Define the mean map kernel as $$ K(P, Q) = \E_{X \sim P, Y \sim Q} k(X, Y) = \langle \E_{X \sim P} \varphi(X), \E_{Y \sim Q} \varphi(Y) \rangle .$$

    The MMD is then \begin{align} \MMD(P, Q) &= \lVert \E_{X \sim P}[\varphi(X)] - \E_{Y \sim Q}[\varphi(Y)] \rVert \\&= \sqrt{K(P, P) + K(Q, Q) - 2 K(P, Q)} \\&= \sup_{f : \lVert f \rVert_{\mathcal H} \le 1} \E_{X \sim P} f(X) - \E_{Y \sim Q} f(Y) .\end{align}

    For our mixtures $P$ and $Q$, note that $$ K(P, Q) = \sum_{i, j} \alpha_i \beta_j K(P_i, Q_j) $$ and similarly for $K(P, P)$ and $K(Q, Q)$.

    It turns out, using similar tricks as for $L_2$, that $K(\N(\mu, \Sigma), \N(\mu', \Sigma'))$ is $$ (2 \pi \sigma^2)^{d/2} \N(\mu; \mu', \Sigma + \Sigma' + \sigma^2 I) .$$

    As $\sigma \to 0$, clearly this converges to a multiple of the $L_2$ distance. You'd normally want to use a different $\sigma$, though, one on the scale of the data variation.

    Closed forms are also available for polynomial kernels $k$ in the MMD; see

    Muandet, Fukumizu, Dinuzzo, and Schölkopf (2012). Learning from Distributions via Support Measure Machines. In Advances in Neural Information Processing Systems (official version). arXiv:1202.6504.

    For a lot of nice properties of this distance, see

    Sriperumbudur, Gretton, Fukumizu, Schölkopf, and Lanckriet (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11, 1517–1561. arXiv:0907.5309.

  • Quadratic Jensen-Rényi divergence. The Rényi-$\alpha$ entropy is defined as $$ H_\alpha(p) = \frac{1}{1-\alpha} \log\left( \int p(x)^\alpha \,\ud x \right) .$$ Its limit as $\alpha \to 1$ is the Shannon entropy. The Jensen-Rényi divergence is $$ \mathrm{JR}_\alpha(p, q) = H_\alpha\left( \frac{p + q}{2} \right) - \frac{H_\alpha(p) + H_\alpha(q)}{2} $$ where $\frac{p + q}{2}$ denotes an equal mixture between $p$ and $q$. It turns out that, when $\alpha = 2$ and when $P$ and $Q$ are Gaussian mixtures (as here), you can compute a closed form for $\mathrm{JR}_2$. This was done by

    Wang, Syeda-Mahmood, Vemuri, Beymer, and Rangarajan (2009). Closed-Form Jensen-Renyi Divergence for Mixture of Gaussians and Applications to Group-Wise Shape Registration. Med Image Comput Comput Assist Interv., 12(1), 648–655. (free pubmed version)