Goodness of Fit – Evaluating Distribution Fit with Estimated Parameters Using KS Tests in R

distributionsestimationgoodness of fitkolmogorov-smirnov testr

Currently I am trying to find a well-known distribution that fits to my positive skewed dataset (n=70) the best. First I used the fitdistrplus R package to estimate parameters for Gamma, Weibull, Lognormal and Exponential distributions (using Maximum Likelihood estimation, though I am unsure if MLE is the best choice with 70 observations (better one?)).

In the second step, I selected the model with the smallest AIC. But of course the model should also pass a goodness of fit test. The first idea was simply using an Kolmogorv-Smirnov test with the estimated parameter, but this doesn't seem to be a good idea since KS-Tests with estimated parameters lead to more or less useless p values.

During my search on the web, I stumbled over Greg Snows' suggestion and apart from that over this page which describes an interesting monte carlo approach (from Clauset et al.). An exemplary adapted R code sample which uses the fitdistrplus package for maximum likelihood estimation for the log norm distribution looks as follows:

lognormal = function(d, limit=2500) {
  # MLE for lognormal distribution
  fit <- fitdist(d,"lnorm", method="mle")

  # compute KS statistic
  t = ks.test(d, "plnorm", meanlog = fit$estimate["meanlog"], sdlog = fit$estimate["sdlog"]);

  # compute p-value
  count = 0;
  for (i in 1:limit) {
    syn = rlnorm(length(d), meanlog = fit$estimate["meanlog"], sdlog = fit$estimate["sdlog"]);
    fit2 <- fitdist(syn, "lnorm", method="mle")
    t2 = ks.test(syn, "plnorm", meanlog = fit2$estimate["meanlog"], sdlog = fit2$estimate["sdlog"]);
    if(t2$stat >= t$stat) {count = count + 1};
  }

  return(list(meanlog = fit$estimate["meanlog"], sdlog = fit$estimate["sdlog"], stat = t$stat, p = count/limit, KSp = t$p));
}

What I am currently asking me (and you) is, does this approach makes sense with respect to the small sample size (or should I use moment/… estimators or is MLE ok) and is the way the goodness of fit is tested suitable?

Best Answer

The paper from Clauset et al. warns (Section 4.2) against small sample sizes (< 100) which are much easier to fit. You may want to consider using the direct comparisons of models.

While the p-value of the KS statistic with estimated parameters is an overestimate, the bootstrapping procedure you described is able to tackle this and provides a correct p-value given enough simulations.

However, the way the goodness of fit is computed in your code is not correct as it does not strictly follow the procedure described in the paper, and implemented in the poweRlaw package.

Specifically: the synthetic data generation procedure is half implemented, it does not search for the best xmin as provided by the extimate_xmin function of the poweRlaw package, and finally the ks.test discards all the ties, which the package doesn't with its built-in KS test.

On this page is provided code that takes into account these issues using poweRlaw; as a consequence it is significantly slower than the code you suggested: http://notesnico.blogspot.com/2014/07/goodness-of-fit-test-for-log-normal-and.html

Related Question