Solved – use Kolmogorov-Smirnov test and estimate distribution parameters

estimationfittingkolmogorov-smirnov test

I've read that Kolmogorov-Smirnov test should not be used to test the goodness of fit of a distribution whose parameters have been estimated from the sample.

Does make sense to split my sample in two and use the first half for parameter estimation and the second one for KS-test?

Thanks in advance

Best Answer

The better approach is to compute your critical value of p-value by simulation. The problem is that when you estimate the parameters from the data rather than using hypothesized values then the distribution of the KS statistic does not follow the null distribution.

You can instead ignore the p-values from the KS test and instead simulate a bunch of datasets from the candidate distribution (with a meaningful set of parameters) of the same size as your real data. Then for each set estimate the parameters and do the KS test using the estimated parameters. You p-value will be the proportion of test statistics from the simulated sets that are more extreeme than for your original data.

Added Example

Here is an example using R (hopefully readable/understandable for people who use other programs).

A simple example using the Normal distribution as the null hypothesis:

tmpfun <- function(x, m=0, s=1, sim=TRUE) {
  if(sim) {
    tmp.x <- rnorm(length(x), m, s)
  } else {
    tmp.x <- x
  }
  obs.mean <- mean(tmp.x)
  obs.sd <- sd(tmp.x)
  ks.test(tmp.x, 'pnorm', mean=obs.mean, sd=obs.sd)$statistic
}

set.seed(20200319)
x <- rnorm(25, 100, 5)

out <- replicate(1000, tmpfun(x))

hist(out)
abline(v=tmpfun(x, sim=FALSE))
mean(out >= tmpfun(x, sim=FALSE))

The function will either compute the KS test statistic from the actual data (sim=FALSE) or simulate a new dataset of the same size from a normal distribution with specified mean and sd. Then in either case will compute the test statistic comparing to a normal distribution with the same mean and sd as the sample (original or simulated).

The code then runs 1,000 simulations (feel free to change and rerun) to get/approximate the distribution of the test statistic under the NULL (but with estimated parameters) then finally compares the test statistic for the original data to this NULL distribution.

We can simulate the whole process (simulations within simulations) to see how it compares to the default p-values:

tmpfun2 <- function(B=1000) {
  x <- rnorm(25, 100, 5)
  out <- replicate(B, tmpfun(x))
  p1 <- mean(out >= tmpfun(x, sim=FALSE))
  p2 <- ks.test(x, 'pnorm', mean=mean(x), sd=sd(x))$p.value
  return(c(p1=p1, p2=p2))
}

out <- replicate(1000, tmpfun2())

par(mfrow=c(2,1))
hist(out[1,])
hist(out[2,])

For my simulation, the histogram of the simulation based p-values is fairly uniform (which is should be since the NULL is true), but the p-values for the ks.test function are bunched up much more against 1.0.

You can change anything in the simulations to estimate power by having the original data come from a different distribution, or using a different Null distribution, etc. The normal is probably the simplest since the mean and variance are independent, more tuning may be needed for other distributions.