Solved – Fitting a gaussian mixture model using stochastic gradient descent

gaussian mixture distributiongradient descentonline-algorithmspython

I'm working on an online category learning model which uses stochastic gradient descent to fit a gaussian mixture model. The model is based on the online learning model used in Toscano & McMurray (2010).

While gradient descent seems to be working fairly well to estimate the means and frequencies/mixing probabilities of the categories, I'm having issues with estimating the covariances of mixture components. The partial derivatives I've been using for the gradient descent update come from Petersen & Pedersen (2008) (p. 44)

Starting with

$p(x) = \sum _k \rho_k \mathcal N_x(\mu_k,\Sigma_k)$

Petersen & Pedersen give the partial derivative with respect to the covariance matrix $\Sigma$ as

$\frac{\delta \ln p(x)}{\delta \Sigma_j}=\frac{\rho_j\mathcal N_x(\mu_j,\Sigma_j)}{\sum _k \rho_k \mathcal N_x(\mu_k,\Sigma_k)}\frac{1}{2}[-\Sigma_j^{-1}+\Sigma_j^{-1}(x-\mu_j)(x-\mu_j)^T\Sigma_j^{-1}]$

The gradient descent step for each $\Sigma_j$, as I've got it implemented in Python is (this is a slight simplification and the $\Delta\Sigma$ for all components is calculated before performing the update):

j.sigma += learning_rate*(G(x)/M(x))*0.5*(-inv(j.sigma) + inv(j.sigma).dot((x-j.mu).dot((x-j.mu).transpose())).dot(inv(j.sigma)))

Where j is an object representing the $j$th component of the mixture and j.sigma and j.mu are that component's mean and variance. G(x)/M(x) is standing in for some code that calculates $\frac{\rho_j\mathcal N_x(\mu_j,\Sigma_j)}{\sum _k \rho_k \mathcal N_x(\mu_k,\Sigma_k)}$

So, I'm wondering if there's something wrong with my code (highly likely) or if this is just a really bad way to fit this kind of model when dealing with data with more than two dimensions (See Toscano & McMurray for algorithms for univariate and bivariate data that definitely work).

references:
Toscano, J. C., & McMurray, B. (2010). Cue integration with categories: Weighting acoustic cues in speech using unsupervised learning and distributional statistics. Cognitive Science, 34, 434-464.

Petersen & Pederson. The Matrix Cookbook, Version: November 14, 2008

Best Answer

Assuming that mus[d] is $\mu_j$, j.sigma is $\Sigma_j$, and G(x)/M(x) indeed computes the posterior probability of component $j$ given the data $x$, $$p(j \mid x) = \frac{\rho_j \mathcal{N}_x(\mu_j, \Sigma_j)}{\sum_k \rho_k \mathcal{N}_x(\mu_k, \Sigma_k)},$$ the gradient itself seems correct to me. But here are some things that I noticed that might help you to find your problem:

  • I would expect the access to the mean, the covariance, and the calculation of the posterior to all involve either j or d, whichever variable represents the component for which you want to compute the gradient in your code. If you tell us what j and d stand for, we might be able to tell you more.
  • If G(x)/M(x) accesses j.Sigma to compute the posterior, your code might not compute what you think it does. It might be better to first compute all the gradients of all the parameters, and then perform the update.
  • Stochastic gradient descent is usually not the first choice to optimize mixtures of Gaussians. Most often, expectation maximization (EM) is used (see, for example, Bishop, 2007). Even if you don't use EM, you might want to consider BFGS or L-BFGS (implemented in scipy.optimize) before using SGD. And even if you stick to SGD, you should consider using multiple data points ("batches") at a time to estimate the gradient, or at least including a momentum term. Briefly looking at Toscano and McMurray's paper, my guess is that they chose to use SGD because they were interested in modeling speech acquisition in a biologically more plausible way, rather than obtaining the best possible fit, and doing this online (that is, one data point at a time). If you don't need this, my advice would be to use EM.

    (I just realized you specifically asked for online learning, so the only viable option for you might be to add the momentum term to speed things up a bit.)

  • The way you chose to compute the gradient is quite inefficient, which will further slow down learning. You might not have seen reasonable results because it takes forever before the algorithm converges to something interesting. Here is a slightly better way to calculate the gradient:

    sigmaInv = inv(j.sigma)
    dSigma = G(x)/M(x) * 0.5 * (-sigmaInv + numpy.sum(sigmaInv.dot(x - mus[d]) * x))
    

    There are still ways to further improve the computation of the gradient. For example, we still get a valid ascent (although not a steepest ascent) direction if we multiple the gradient by a positive definite matrix (such as $\Sigma_j$, which would simplify the gradient a bit). It might also work better if we used a different parametrization of the covariance, such as Cholesky factors, and computed the gradients of those instead.