I have a vector of numeric values. My hypothesis is that this vector is a mixture drawn from two Gaussian distributions (ie k = 2). However, it is possible that there is only one Gaussian underlying my data (k = 1). I am attempting to answer this question in a data-driven manner but do not know the best method?
My thought was to compare the two methods by calculating the BIC or AIC for each, and then performing a log-likelihood test.
-
Should I include k as one of the parameters being estimated when I calculate BIC (ie {mu1, sd1, mu2, sd2, k} vs {mu1, sd1, k} for the 2-component and 1-component models respectively)
-
I'm using the mixtools package in R and the normalmixEM() function does not seem to allow fitting a 1-component gaussian (ie if I use k = 1 I get an error
arbmean and arbvar cannot both be FALSE
) -
If using a LR with AIC/BIC is not appropriate, is there a more appropriate solution to this problem?
Edit: I found a somewhat illuminating example here. This approach uses the mclust package to fit a 1 vs 2 component gaussian mixture and use the model log-likelihood to perform a likelihood ratio test.
Best Answer
An alternative strategy is to test for Normality. If your data comes from a single Gaussian, you should fail to reject the null hypothesis. Conversely, if you get a statistically significant p-value for rejecting the null hypothesis, then you know that k > 1. This strategy can be easily generalized to the multi-variate case by performing PCA and testing each principal component separately.
Since you're working with R, I recommend you take a look at the
nortest
package.