Hypothesis Testing – Best Approach for Goodness of Fit for Discrete Data

goodness of fithypothesis testing

The data:
For the purposes of this question/communication we can assume the data looks like rnbinom(1000,size=0.1,prob=0.01) in R, which generates a random sample of 1,000 observations from a negative binomial distribution (with size=0.1 and probability of success prob=0.01). This is the parametrization where the random variable represents the number of failures before size number of successes. The tail is long, and 1,000 observations is not a lot of data.

The problem:
I have been given some data (integer on {1,2,….}) [see above] (1,500 data points) and asked to find "best fit" distribution and estimates of any parameters. I know nothing else about the data. I'm aware I this is not a very large sample for data with a long tail. More data is a possibility.

What I've done:
I have considered using a likelihood ratio test by fitting two different distributions to the data, but I don't think this applies (as in, I cannot determine appropriate critical p-values) unless the two distributions are nested…

I then considered using a Kolmogorov-Smirnov test (adjusted for discrete data) but, in R anyway, it complained it could not compute a p-value for "data with ties".

What is the best way for me to go about testing/determining the fit of different distributions in this context?
Here are some other things I have considered:

  1. Ask for (lots) more data. But will this help? Will I be able to use asymptotic results, for instance?
  2. Consider some bootstrap/re-sampling/monte-carlo scheme? If so, is there a standard reference I can/should read to learn how to do this correctly? Thanks

Best Answer

If I understood your question correctly, you just need to fit data to distribution. In this case, you could use one of functions in R packages, such as fitdistr from MASS package, which uses maximum likelihood estimation (MLE) and supports discrete distributions, including binomial and Poisson.

Then, as a second step, you would need to perform one (or more) of goodness-of-fit (GoF) tests to validate results. Kolmogorov-Smirnov, Anderson-Darling and (AFAIK) Lilliefors tests all are not applicable to discrete distributions. However, fortunately, chi-square GoF test is applicable to both continuous and discrete distributions and in R is a matter of calling stats::chisq.test() function.

Alternatively, as your data represents a discrete distribution, you can use vcd package and its function goodfit(). This function can be used either as a replacement for standard GoF test chisq.test(), or, even better, as a full workflow (distribution fitting and GoF testing). For the full workflow option, just use default setup and do not specify parameters par (you can specify size, if type = "nbinomial"). The parameters will be estimated, using maximum likelihood or minimum chi-square (you can select the method). Results can be obtained by calling summary() function.