Solved – Gaussian mixture regression in higher dimensions

curve fittinggaussian mixture distributionleast squaresnonlinear regressionoptimization

Problem:

I have a discrete representation of a surface/height-map $z = f(x,y)$ that i want to model as a mixture of Gaussians (please take probability distributions out of your mind for a moment). More precisely,

$ z = f(x) = \sum\limits_{i=1}^{n_g} a_i \; g(x; c_i, \Sigma_i)$

where

  • $x \in R^d$
  • $g(x; c_i, \Sigma_i) = exp\{-\frac{(x-c_i)^T\Sigma_i^{-1}(x-c_i)}{2} \}$,
  • $a_i \geq 0$ is the max-height of the $i$-th Gaussian component,
  • $c_i \in R^d$ is the center of the $i$-th Gaussian component, and
  • $\Sigma_i$ is the covariance of the $i$-th Gaussian component that determines its support in space and its orientation.

Given the data of the surface, I want to fit this model, i.e estimate the parameters $[…,a_i, c_i, \Sigma_i, …]$ given some reasonable initial guesses by minimizing the least squares objective function given below:

$E(a,c,\Sigma) = \sum\limits_{j=1}^N \left(z_j – f(x_j; a, c, \Sigma)\right)^2$

Constraints:

  • $a_i \geq 0$,
  • $\Sigma_i$ should be symmetric ($\Sigma_i = \Sigma_i^T$), and positive-definite ($u^T\Sigma_i u > 0 \; \forall{u} \in R^d$).

I would like to emphasize that I want to optimize for the full covariance matrices $\Sigma_i$ (even the diagonal won't suffice).

Approach so far

I will describe below the approach I have been venturing on so far but I would really appreciate any other ideas in fitting this model to data or at least point me to the right literature to read (Googling takes me to so much literature on fitting a mixture model to a probability distribution using EM). I might have a huge number of components in the mixture. So it will be nice if you could supplement your answers with comments on computational complexity, the quality of the solution (how far from global minimum) and sensitivity to initialization (how close should the initial guesses be).

My initial idea is to use iterative minimization algorithms such as the Gauss-Newton algorithm or the Levenberg–Marquardt algorithm.

So I started off by deriving the derivatives of the error/objective function $E(a,c,\Sigma)$ with respect to each of its parameters:

$\begin{eqnarray}
\frac{\partial E}{\partial a_i} & = & -2\sum\limits_{j=1}^N \left(z_j – f(x_j; a, c, \Sigma)\right) \frac{\partial f(x_j; a, c, \Sigma)}{\partial a_i} \\
& = & -2\sum\limits_{j=1}^N \left(z_j – f(x_j; a, c, \Sigma)\right) g(x_j; c_i, \Sigma_i)
\end{eqnarray}$

$\begin{eqnarray}
\frac{\partial E}{\partial c_i} & = & -2\sum\limits_{j=1}^N \left(z_j – f(x_j; a, c, \Sigma)\right) \frac{\partial f(x_j; a, c, \Sigma)}{\partial c_i} \\
& = & -2\sum\limits_{j=1}^N \left(z_j – f(x_j; a, c, \Sigma)\right) \; a_i \frac{\partial g(x_j; c_i, \Sigma_i)}{\partial c_i} \\
& = & -2 \sum\limits_{j=1}^N \left(z_j – f(x_j; a, c, \Sigma)\right) \; a_i \; g(x_j; c_i, \Sigma_i) \; \Sigma_i^{-1} (x_j – c_i)
\end{eqnarray}$

$\begin{eqnarray}
\frac{\partial E}{\partial \Sigma_i} & = & -2\sum\limits_{j=1}^N \left(z_j – f(x_j; a, c, \Sigma)\right) \frac{\partial f(x_j; a, c, \Sigma)}{\partial \Sigma_i} \\
& = & -2\sum\limits_{j=1}^N \left(z_j – f(x_j; a, c, \Sigma)\right) \; a_i \frac{\partial g(x_j; c_i, \Sigma_i)}{\partial \Sigma_i} \\
& = & -2 \sum\limits_{j=1}^N \left(z_j – f(x_j; a, c, \Sigma)\right) \; a_i \; g(x_j; c_i, \Sigma_i) \; \frac{\Sigma_i^{-T} (x_j – c_i)(x_j – c_i)^T\Sigma_i^{-T}}{2}
\end{eqnarray}$

Now while I minimize the error function $E(a,c,\Sigma)$ using one of the iterative minimization algorithms, I want to enforce that the covariance matrices $\Sigma_i$ are symmetric and positive definite. And I don't know how to enforce that constraint? (should I try to optimize for elements on one side of the diagonal which is sort of projecting it to the domain of symmetric matrices but not necessarily positive definite?) I am guessing this is an already solved problem, at least reasonably. It would be great if you could point me to the literature for this problem so I can move forward.

Alternatively, coming from a computer-vision + machine-learning background, I can't resist but to think about applying the Expectation-Maximization algorithm. Intuitively, if I knew which Gaussian component each $(x,y,z)$ location belongs to (in a soft sense) then I can compute the parameters of the Gaussian components. And, if I know the parameters of the Gaussian mixture, I can compute the degree to which each $(x,y,z)$ location belongs each component in the mixture. And, Expectation-Maximization algorithms in the end boils down to minimization of an objective function using coordinate descent. So it would be nice if you could point me to any literature you can point me to that takes the EM approach to this problem.

Best Answer

Your idea is exactly what I would recommend. To enforce the positive definiteness of the covariance matrices, just use the Cholesky decomposition $\Sigma_i = L_iL_i^T$, and minimise the objective with respect to $L_i$ instead of $\Sigma_i$. I refer to this article for more details. Also, you may try to optimise the log-likelihood instead of the mean square error.

Related Question