From my experiment, given training samples of around 600 each having vector length of 30, the single gaussian component performed much better unexpectedly. Having said that, I want to know how or when it is better to use mixture of Gaussian (any rule of thumb or whatever)
Solved – When is Gaussian mixture model more effective than Single Gaussian
gaussian mixture distributionmachine learning
Related Solutions
The number of data points needed will depend on the dimensionality of the data, the number of mixture components, and whether you use any constraints/priors (e.g. shared covariance matrix, diagonal covariance matrix, regularized covariance matrices, etc.). It will also depend on the data itself, so it's hard to give an exact number.
As a lower bound, consider a Gaussian mixture model with $d$ dimensions and $p$ mixture components, each with a separate covariance matrix. For each component, you'll have to estimate a mean vector ($d$ elements) and covariance matrix ($d \cdot (d+1)/2$ independent elements, because it's a symmetric matrix). You'll also have to estimate the $p$ mixture weights. So, your total number of parameters is $p \cdot (d^2/2 + 3d/2 + 1)$. You will certainly need more data points than parameters. Several times more.
To see the effect of constraints, consider that using a shared covariance matrix will reduce the number of covariance matrix parameters by a factor $p$ (because you only need to estimate a single matrix instead of $p$ of them). Or, a diagonal covariance matrix will have $d$ independent elements instead of $d \cdot (d+1)/2$. Using a low-dimensional mixture model (e.g. mixture of factor analyzers) will similarly reduce the number of parameters. The issue of regularized covariance matrices and other types of priors is more subtle. All of these methods will reduce the number of samples needed by reducing the flexibility of the model.
The danger of using the likelihood of the training data as a goodness-of-fit criterion is overfitting. It's possible that the model could learn to fit random structure in your training set that isn't representative of the underlying distribution from which the data were drawn. In this case, the model could perform poorly on future data drawn from the same distribution. What to do depends on your goal. If you want to compare multiple models to each other, the thing to do is look into 'model selection'.
An alternative strategy is to test for Normality. If your data comes from a single Gaussian, you should fail to reject the null hypothesis. Conversely, if you get a statistically significant p-value for rejecting the null hypothesis, then you know that k > 1. This strategy can be easily generalized to the multi-variate case by performing PCA and testing each principal component separately.
Since you're working with R, I recommend you take a look at the nortest
package.
Best Answer
In your case a single kernel model has $D+D(D-1)/2 = 465$ degrees of freedom, $30$ from the mean, and $435$ from the elements of the covariance matrix. Thus you only have less than 2 samples per degree of freedom, and a two kernel model would have less than one data sample per degree of freedom. A rule of thumb would be to require at least 3-5, and prefer $>10$, samples per degree of freedom for model estimation.
Your specific case is probably one of overfitting: the mixture model has more degrees of freedom, and thus can better fit the training data, including statistical fluctuations in it. Sometimes this is referred to as "fitting the noise" or just "overfitting". The lower number of degrees of freedom in the single kernel model provided better generalization.
More generally, this is problem in model selection.
Globally, it's better to use a mixture of Gaussians when that kind of distribution can more accurately reflect the true distribution of your data. Of course, you don't actually know the true distribution of the data.
In some cases you can identify whether a single kernel, or a mixture of kernels are required by qualitative data analysis. Usually features like the Bayesian Information Criterion (BIC) or Akaike Information Criterion are used to formalize the idea of balancing the goodness of fit to your limited data sample against the degrees of freedom that your model provides.