Solved – Evaluate mixture model

goodness of fitmixture-distributionmodelnormal distribution

I have a question concerning the evaluation of mixture models. Is there a gold standard to compute the goodness of a fit for a mixture model?

What I am concerned about is how one would evaluate if one, two or three gaussians fit a given distribution better. Truly, one could visually inspect that but I am looking for an automated way that has a statistical meaning.

My initial idea was to measure the KS statistic between the observed distribution and sampled distributions by the estimated mean and variance for each model. Admittedly, I am not an expert for mixture models so I might miss something obvious here.

So I guess, what I am looking for is some kind of likelihood ratio test than gives me the best performing model for one, two or three overlapping distributions.

I am very thankful for any keywords and links that I can look up!

Best Answer

You can use a model selection tool such as AIC or BIC to compare the models. However, this does not tell you about the goodness of fit. The same applies to the Likelihood ratio.

A formal goodness of fit test can be conducted by using the chi-square goodness of fit test. This is very sensitive to the choice of the bin-width, though.

A less formal, and more visual, goodness of fit test is the QQ-envelope which is obtained as follows:

Fit the mixture model to your data.
Simulate $N$ (large) samples from the fitted model of the same size as the original sample.
Calculate the QQ plots for each simulated sample against the original sample, and plot all of them together. This will generate an envelope that tells you how good or bad your model reproduces the data.

You can use this tool to identify areas where the model produces a poor fit.

Related Solutions

Distribution Selection – How to Choose the Most Appropriate Distribution for a Given Vector in R

When deciding on a distribution, the science is more important than the tests. Think about what lead to the data, what values are possible, likely, and meaningful. The formal tests can find obvious differences, but often cannot rule out distributions that are similar (and note that the chi-squared distribution is a special case of the gamma distribution). Look at this quick simulation (and try it with other values):

> mean(replicate(1000, ks.test( rt(5000, df=20), pnorm )$p.value)<0.05)
[1] 0.111

The ks.test can only find the difference between a t-distribution with 20 df and a standard normal 11% of the time, even with a sample size of 5000.

If you really want to test the distributions, then I would suggest using the vis.test function in the TeachingDemos package. Instead of rigid tests of exact fit, it presents a plot of the original data mixed in with similar plots from the candidate distribution and asks you (or another viewer) to pick out the plot of the original data. If you cannot distinguish visually between your data and the simulated data then the candidate distribution is probably a reasonable starting point (but this does not rule out other possible distributions, think about which ones make the most sense scientifically).

Another approach would be to generate your new data from the density estimate of your original data. The logspline package for R has functions to estimate the density, then generate random data from that estimate. Or, generating data from a kernal density estimate means selecting a point from your data, then generating a random value from the kernal centered around that point. This can be as simple as selecting a random sample from the data with replacement, then adding small normal deviates to the values.

Solved – Fitting mixture distributions and computing goodness-of-fit

Mixture modelling in my experience can be tricky. Getting a good fit from a mixture model can be much more difficult than realising that a mixture may be a good approach.

I have to say that I don't find the fit convincing here. Your graphics alone do not give much support to your summary of "decent".

Kernel density estimates do give some support to the idea of two modes. (That is, using the default choice of your software, as the second mode disappears readily if you smooth enough.) The first mode is about log(x) = 6. The second is about log(x) = 12, but is however much weaker. The density at the first mode is higher by a factor of about 4 or 5, again at or near default choices of kernel type and width. On your graph, the kernel density is the dashed line, but as you show it it has been truncated: the density rises to more than 0.25 if smoothed similarly to what you did. (I used different software, but that should be immaterial.)

In fact a complete curve can be seen at What distribution does my data follow?

In contrast the mixture model yields two modes with more nearly equal density and the position of the second fitted mode does not correspond well to the observed secondary mode, being higher at about log(x) = 14. Furthermore, in each case the density of the other distribution is negligible at the position of the mode, so the ratio of densities at the two modes would be about the same in the combined distribution. (It isn't expected that modes of fitted components correspond exactly to modes in the data, but the mismatch here is disappointing.)

The histogram here is a mystery. With this number of data (1567 observations) you could afford more bins than 9. But what is the histogram showing? If it's showing the combined fitted mixture model, that should be a smooth curve; if it's showing the original data, it's not doing a good job.

That said, what is currently holding you up? The error message from the Kolmogorov-Smirnov test function indicates that it objects to ties, and as you do have ties in your data, that seems right. But, as I have discussed, the graphics alone tell you that the fit is not good. Wanting a P-value too would just be icing an unwelcome cake.

A more convincing fit therefore might require the distribution with the higher mode to be a much smaller fraction of the total, but I don't have good ideas on how to move in that direction.

Alternatively, many real distributions don't approximate theoretical distributions (or even mixtures of them) at all well. It's always welcome when that happens, but it can be fine just to present smoothed density estimates and say "This is what we have". Even in cases where we think we have two distinct sub-populations (say height or weight for males or females), it can be really hard to see that from a combined distribution.

(Incidentally, for the sake of many readers, please do not use both red and green curves in a graph.)

Best Answer

Related Solutions

Distribution Selection – How to Choose the Most Appropriate Distribution for a Given Vector in R

Solved – Fitting mixture distributions and computing goodness-of-fit

Related Question