Goodness of Fit – Testing Goodness of Fit for a Mixture in R

goodness of fitmixture-distributionnormal distributionr

I just estimated the parameters for a mixture of two gaussians with different means and different sigmas, I would like to test if the data adjusts well to the explicit form of the mixture, do I necessarily need to simulate data from the mixture or how can I test the goodness of fit? I am using the mixtools package.

Best Answer

You can write a function that calculates the relevant value for a given test under the null hypothesis based on the outputs from, say, normalmixEM, then do the test using that. For example, for the Kolmogorov-Smirnov test, we need the CDF given a set of parameters:

# CDF of mixture of two normals
pmnorm <- function(x, mu, sigma, pmix) {
  pmix[1]*pnorm(x,mu[1],sigma[1]) + (1-pmix[1])*pnorm(x,mu[2],sigma[2])
}

We then run the K-S test in the usual way:

# Sample run
x <- c(rnorm(50), rnorm(50,2))

foo <- normalmixEM(x)
test <- ks.test(x, pmnorm, mu=foo$mu, sigma=foo$sigma, pmix=foo$lambda)
test

    One-sample Kolmogorov-Smirnov test

data:  x 
D = 0.0559, p-value = 0.914
alternative hypothesis: two-sided 

Always keeping in mind the fact that we estimated the parameters from the same data we are using to do the test, thus biasing the test towards failure to reject H0.

We can overcome that latter bias to some extent via a parametric bootstrap - generating many samples from a mixture of normals parameterized by the estimates from normalmixEM, then estimating the parameters of the samples and calculating the test statistics for each sample using the estimated parameters. Under this construction, the null hypothesis is always true. In the code below, I'm helping the EM algorithm out by starting at the true parameters for the sample, which is also cheating a little, as it makes the EM algorithm more likely to find values near the true values than with the original sample, but greatly reduces the number of error messages.

# Bootstrap estimation of ks statistic distribution
N <- length(x)
ks.boot <- rep(0,1000)
for (i in 1:1000) {
  z <- rbinom(N, 1, foo$lambda[1])
  x.b <- z*rnorm(N, foo$mu[1], foo$sigma[1]) + (1-z)*rnorm(N, foo$mu[2], foo$sigma[2])
  foo.b <- normalmixEM(x.b, maxit=10000, lambda=foo$lambda, mu=foo$mu, sigma=foo$sigma)
  ks.boot[i] <- ks.test(x.b, pmnorm, mu=foo.b$mu, sigma=foo.b$sigma, pmix=foo.b$lambda)$statistic
}

mean(test$statistic <= ks.boot)
[1] 0.323

So instead of a p-value of 0.914, we get a p-value of 0.323. Interesting, but not particularly important in this case.