Goodness of Fit – Testing Goodness of Fit for a Mixture in R

goodness of fitmixture-distributionnormal distributionr

I just estimated the parameters for a mixture of two gaussians with different means and different sigmas, I would like to test if the data adjusts well to the explicit form of the mixture, do I necessarily need to simulate data from the mixture or how can I test the goodness of fit? I am using the mixtools package.

Best Answer

You can write a function that calculates the relevant value for a given test under the null hypothesis based on the outputs from, say, normalmixEM, then do the test using that. For example, for the Kolmogorov-Smirnov test, we need the CDF given a set of parameters:

# CDF of mixture of two normals
pmnorm <- function(x, mu, sigma, pmix) {
  pmix[1]*pnorm(x,mu[1],sigma[1]) + (1-pmix[1])*pnorm(x,mu[2],sigma[2])
}

We then run the K-S test in the usual way:

# Sample run
x <- c(rnorm(50), rnorm(50,2))

foo <- normalmixEM(x)
test <- ks.test(x, pmnorm, mu=foo$mu, sigma=foo$sigma, pmix=foo$lambda)
test

    One-sample Kolmogorov-Smirnov test

data:  x 
D = 0.0559, p-value = 0.914
alternative hypothesis: two-sided

Always keeping in mind the fact that we estimated the parameters from the same data we are using to do the test, thus biasing the test towards failure to reject H0.

We can overcome that latter bias to some extent via a parametric bootstrap - generating many samples from a mixture of normals parameterized by the estimates from normalmixEM, then estimating the parameters of the samples and calculating the test statistics for each sample using the estimated parameters. Under this construction, the null hypothesis is always true. In the code below, I'm helping the EM algorithm out by starting at the true parameters for the sample, which is also cheating a little, as it makes the EM algorithm more likely to find values near the true values than with the original sample, but greatly reduces the number of error messages.

# Bootstrap estimation of ks statistic distribution
N <- length(x)
ks.boot <- rep(0,1000)
for (i in 1:1000) {
  z <- rbinom(N, 1, foo$lambda[1])
  x.b <- z*rnorm(N, foo$mu[1], foo$sigma[1]) + (1-z)*rnorm(N, foo$mu[2], foo$sigma[2])
  foo.b <- normalmixEM(x.b, maxit=10000, lambda=foo$lambda, mu=foo$mu, sigma=foo$sigma)
  ks.boot[i] <- ks.test(x.b, pmnorm, mu=foo.b$mu, sigma=foo.b$sigma, pmix=foo.b$lambda)$statistic
}

mean(test$statistic <= ks.boot)
[1] 0.323

So instead of a p-value of 0.914, we get a p-value of 0.323. Interesting, but not particularly important in this case.

Related Solutions

Solved – Plot explicit cdf instead of ecdf in R

Like the title of the function ecdf() says, it is empirical and only runs on samples.

If you want the exact cdf of a Gaussian, the function you are looking for is pnorm(). Here is a demonstration.

x <- seq(from=-5, to=5, by=.1)
y <- pnorm(x)
plot(x, y, type='l')

If you replace dnorm() by pnorm() in your code, and x by the range of values you want to take the cdf over you should get the result you are looking for.

Solved – Fitting mixture distributions and computing goodness-of-fit

Mixture modelling in my experience can be tricky. Getting a good fit from a mixture model can be much more difficult than realising that a mixture may be a good approach.

I have to say that I don't find the fit convincing here. Your graphics alone do not give much support to your summary of "decent".

Kernel density estimates do give some support to the idea of two modes. (That is, using the default choice of your software, as the second mode disappears readily if you smooth enough.) The first mode is about log(x) = 6. The second is about log(x) = 12, but is however much weaker. The density at the first mode is higher by a factor of about 4 or 5, again at or near default choices of kernel type and width. On your graph, the kernel density is the dashed line, but as you show it it has been truncated: the density rises to more than 0.25 if smoothed similarly to what you did. (I used different software, but that should be immaterial.)

In fact a complete curve can be seen at What distribution does my data follow?

In contrast the mixture model yields two modes with more nearly equal density and the position of the second fitted mode does not correspond well to the observed secondary mode, being higher at about log(x) = 14. Furthermore, in each case the density of the other distribution is negligible at the position of the mode, so the ratio of densities at the two modes would be about the same in the combined distribution. (It isn't expected that modes of fitted components correspond exactly to modes in the data, but the mismatch here is disappointing.)

The histogram here is a mystery. With this number of data (1567 observations) you could afford more bins than 9. But what is the histogram showing? If it's showing the combined fitted mixture model, that should be a smooth curve; if it's showing the original data, it's not doing a good job.

That said, what is currently holding you up? The error message from the Kolmogorov-Smirnov test function indicates that it objects to ties, and as you do have ties in your data, that seems right. But, as I have discussed, the graphics alone tell you that the fit is not good. Wanting a P-value too would just be icing an unwelcome cake.

A more convincing fit therefore might require the distribution with the higher mode to be a much smaller fraction of the total, but I don't have good ideas on how to move in that direction.

Alternatively, many real distributions don't approximate theoretical distributions (or even mixtures of them) at all well. It's always welcome when that happens, but it can be fine just to present smoothed density estimates and say "This is what we have". Even in cases where we think we have two distinct sub-populations (say height or weight for males or females), it can be really hard to see that from a combined distribution.

(Incidentally, for the sake of many readers, please do not use both red and green curves in a graph.)

Best Answer

Related Solutions

Solved – Plot explicit cdf instead of ecdf in R

Solved – Fitting mixture distributions and computing goodness-of-fit

Related Question