Solved – the best way to perform a goodness-of-fit test of data to a continuous distribution with the chi-square method in R

goodness of fitr

I'm trying to write a piece of code in R that

  • finds the most-fitting distribution to a set of data, by
  • performing goodness-of-fit tests to a list of distributions, and then
  • finding the most fitting one.
  • This program should be able to run in real-time, so analysis should be very light on computational load. What I mean by this is that it should be able to process, say, a fit every second or few seconds at the most, so the simpler the program, the better.

For instance, I've already written the following code:

for(i in 1:numfit) {
if(distrib[[i]] == "negative binomial"){
  gf_shape = "negative binomial"
  fd_nb <- tryCatch((fitdistr(data, "negative binomial", start=list(size=1,prob=0.5))),
    error = function(fd_nb) fd_nb <- fitdistr(data, "negative binomial"))
  est_size = fd_nb$estimate[[1]]
      est_prob = fd_nb$estimate[[2]]
  gfn = goodfit(data,type="nbinomial",method="MinChisq",par = list(size = est_size))
  tidied = tidy(summary(gfn))
  results[i,] = c(gf_shape, est_lambda, "NA", tidied$X.2, tidied$P...X.2.)
}

else if(distrib[[i]] == "poisson"){
  gf_shape = "poisson"
  fd_p <- fitdistr(data, "poisson")
  est_lambda = fd_p$estimate[[1]]
  gf = goodfit(data,type="poisson",method="MinChisq",par = list(lambda = est_lambda))
  tidied = tidy(summary(gf))
  results[i,] = c(gf_shape, est_lambda, "NA", tidied[1,1], tidied[1,3])
}
results = rbind(c("distribution", "parameter 1", "parameter 2", "chi-squared test statistic", "P > X2"),   results)
return(results)
}

that performs a chi-square goodness-of-fit test of my data to a Poisson and negative binomial distribution, from which I can then find the distribution with the lowest chi-squared test statistic and infer the most suitable distribution to the data from there.

My question is how to do this with continuous data/continuous distributions. Using the goodfit package in R worked really well for me, but it only works with discrete distributions. I do, however, need to use the chi-square goodness-of-fit method (project requirement), so I'm not sure which package or method to turn to and how to implement this.

Can anyone help me with an easy idea of how to implement something similar for, say, a normal distribution? At the moment I think I'm just going to write a piece of code to bin the data into categories, the way you would do the chi-square test by hand, but any way to optimize this process would help. Help would be much appreciated.

Best Answer

It's a pity that a chi-square goodness-of-fit test is a "project requirement," because in general there is much to be lost by binning continuous variables. If possible, try to convince those in charge to allow methods more appropriate for continuous variables in this context of distribution fitting.

If you are stuck with binning you might have a problem. Essentially, if you define the bin boundaries based on parameters of a distribution estimated from the data, then the chi-square test itself may no longer be valid. This is beyond the simple loss of degrees of freedom in chi-square due to estimating parameters; the chi-square statistic as usually calculated may be no longer distributed as chi-square. See the "Level 3" section of the answer by @cardinal on that linked page; that answer includes some recommendations for how to proceed (which are beyond my particular expertise). As you proceed, you do have to consider what you are trying to accomplish by fitting different types of distributions to the same data, and to recognize that results might be hard to generalize even if you find a reliable way to define useful bin boundaries.

Related Question