Solved – Number of components for Gaussian mixture model

bicfinite-mixture-modelgaussian mixture distributionmixture-distributionr

I have a vector of numeric values. My hypothesis is that this vector is a mixture drawn from two Gaussian distributions (ie k = 2). However, it is possible that there is only one Gaussian underlying my data (k = 1). I am attempting to answer this question in a data-driven manner but do not know the best method?

My thought was to compare the two methods by calculating the BIC or AIC for each, and then performing a log-likelihood test.

  1. Should I include k as one of the parameters being estimated when I calculate BIC (ie {mu1, sd1, mu2, sd2, k} vs {mu1, sd1, k} for the 2-component and 1-component models respectively)

  2. I'm using the mixtools package in R and the normalmixEM() function does not seem to allow fitting a 1-component gaussian (ie if I use k = 1 I get an error arbmean and arbvar cannot both be FALSE)

  3. If using a LR with AIC/BIC is not appropriate, is there a more appropriate solution to this problem?

Edit: I found a somewhat illuminating example here. This approach uses the mclust package to fit a 1 vs 2 component gaussian mixture and use the model log-likelihood to perform a likelihood ratio test.

Best Answer

An alternative strategy is to test for Normality. If your data comes from a single Gaussian, you should fail to reject the null hypothesis. Conversely, if you get a statistically significant p-value for rejecting the null hypothesis, then you know that k > 1. This strategy can be easily generalized to the multi-variate case by performing PCA and testing each principal component separately.

Since you're working with R, I recommend you take a look at the nortest package.

Related Question