Solved – Testing for Unimodality or Bimodality Data Using MATLAB

aicdistributionsgaussian mixture distributionkernel-smoothingMATLAB

I am trying to figure out what I did wrong or what I could do to get accurate results.

I have n vectors of data, and I am trying to decide whether each dataset is unimodal or bimodal. I assumed that it could be a mixture of Gaussians, so in MATLAB I attempted:

fit1 = gmdistribution(data(:,x),1)  %x is the index for each vector
fit2 = gmdistribution(data(:,x),2) 

I used the minimum AIC of the two to decide whether it is unimodal or bimodal. The results return all to be bimodal, but they all aren't. Also by looking at the histograms of the data, they all don't seem bimodal. My question is why am I getting inaccurate or false positive results? Is there a better method? Did I do something wrong?

SOME EDITS:

using the gmdistribution function and the mean and variance from fit1 and fit2, I created random samples using:

Y = random(obj,length(original_data) %obj contains unimodal dist.
Y2 = random(obj2, length(original_data) %obj2 contains bimodal
hist(Y, length(original_data)
hist(Y2, length(original_data)

I compared each to the original vectors of data and the bimodal seems to provide samples that most resembles the original data. Is this a proper way of testing the AIC or dip test? I also used the hartigans dip test in matlab and the p values were close to 0 (less than 0.05), so I assumed this also means bimodal. I checked the kurtosis also, but they didn't return all bimodal. Any inputs or direction?

Best Answer

Three points:

  1. You are talking about floating point variables; therefore the comparison fit1.AIC == fit2.AIC is very unlikely to ever happen. Check the concept of machine epsilon. Probably you want something in line of: abs(fit1.AIC - fit2.AIC) < 2*eps to test equality.
  2. Usually an AIC difference equal or less than 2 is practically a tie as in this case the relative Akaike weight are quite inconclusive. AIC should not be used in absolute terms. Some useful informal references: [1,2,3], and two real ones: Burnham and Anderson, 2002 and Posada and Buckley, 2004.
  3. I suspect that what might happen is that probably you are using a mixture of two Gaussian with $\mu$'s that are relatively close but std. deviations $\sigma$'s that are quite different. In that way your second distribution can account for some outliers in your sample. Without the actual data and/or at least the outputs of fit1 and fit2 one can only make guesses.