Solved – How to know if the Gaussian mixture model has enough training data

gaussian mixture distributionmachine learning

A somewhat soft question – I'm training a Gaussian mixture model (with the EM algorithm) on data of size $N$ ($N$ is typically between 4 and 64).
How much samples do I need? obviously it depends on the data, but is there a thumb rule?
Also, is there a method to assess whether the model is trained properly? I can look at the log-likelihood function and see its convergence, but what about its value? can that serve as an 'estimation' of some sort to the success of the training?

Thanks.

Best Answer

The number of data points needed will depend on the dimensionality of the data, the number of mixture components, and whether you use any constraints/priors (e.g. shared covariance matrix, diagonal covariance matrix, regularized covariance matrices, etc.). It will also depend on the data itself, so it's hard to give an exact number.

As a lower bound, consider a Gaussian mixture model with $d$ dimensions and $p$ mixture components, each with a separate covariance matrix. For each component, you'll have to estimate a mean vector ($d$ elements) and covariance matrix ($d \cdot (d+1)/2$ independent elements, because it's a symmetric matrix). You'll also have to estimate the $p$ mixture weights. So, your total number of parameters is $p \cdot (d^2/2 + 3d/2 + 1)$. You will certainly need more data points than parameters. Several times more.

To see the effect of constraints, consider that using a shared covariance matrix will reduce the number of covariance matrix parameters by a factor $p$ (because you only need to estimate a single matrix instead of $p$ of them). Or, a diagonal covariance matrix will have $d$ independent elements instead of $d \cdot (d+1)/2$. Using a low-dimensional mixture model (e.g. mixture of factor analyzers) will similarly reduce the number of parameters. The issue of regularized covariance matrices and other types of priors is more subtle. All of these methods will reduce the number of samples needed by reducing the flexibility of the model.

The danger of using the likelihood of the training data as a goodness-of-fit criterion is overfitting. It's possible that the model could learn to fit random structure in your training set that isn't representative of the underlying distribution from which the data were drawn. In this case, the model could perform poorly on future data drawn from the same distribution. What to do depends on your goal. If you want to compare multiple models to each other, the thing to do is look into 'model selection'.