1) There are two issues with the Kolmogorov-Smirnov* -
a) it assumes the distribution is completely specified, with no estimated parameters. If you estimate parameters a KS becomes a form of Lilliefors test (in this case for Poisson-ness), and you need different critical values
b) it assumes the distribution is continuous
both impact the calculation of p-values, and both make it less likely to reject.
*(and the Cramer-von Mises and the Anderson Darling, and any other test that assumes a continuous, completely specified null)
Unless you don't mind a potentially highly-conservative test (of unknown size), you have to adjust the calculation of the significance for both of these; simulation would be called for.
2) on the other hand, a vanilla chi-square goodness of fit is a terrible idea when testing something that's ordered, as a Poisson is. By ignoring ordering, it's really not very sensitive to the more interesting alternatives - it throws away power against directly interesting alternatives like overdispersion, instead spending its power against things like 'an excess of even numbers over odd numbers'. As a result its power against interesting alternatives is generally even lower than the vanilla KS but without the compensation of the much lower type I error rate.
I think this is even worse.
3) on the gripping hand, you can partition the chi-squared into components that do respect the ordering via the use of orthogonal polynomials, and drop off the less interesting highest-order components. In this particular case you'd use polynomials orthogonal to the Poisson p.f.
This is an approach taken in Rayner and Best's little 1989 book on Smooth Tests of Goodness of Fit (they have a newer one on smooth tests in R that might make your life easier)
Alternatively, see papers like this one:
http://www.jstor.org/discover/10.2307/1403470
4) However, depending on why you're doing it, it may be better to reconsider the whole enterprise...
The discussion in questions like these carry over to most goodness of fit tests ... and indeed often to most tests of assumptions in general:
Is normality testing 'essentially useless'?
What tests do I use to confirm that residuals are normally distributed?
The paper from Clauset et al. warns (Section 4.2) against small sample sizes (< 100) which are much easier to fit. You may want to consider using the direct comparisons of models.
While the p-value of the KS statistic with estimated parameters is an overestimate, the bootstrapping procedure you described is able to tackle this and provides a correct p-value given enough simulations.
However, the way the goodness of fit is computed in your code is not correct as it does not strictly follow the procedure described in the paper, and implemented in the poweRlaw
package.
Specifically: the synthetic data generation procedure is half implemented, it does not search for the best xmin
as provided by the extimate_xmin
function of the poweRlaw
package, and finally the ks.test
discards all the ties, which the package doesn't with its built-in KS test.
On this page is provided code that takes into account these issues using poweRlaw
; as a consequence it is significantly slower than the code you suggested: http://notesnico.blogspot.com/2014/07/goodness-of-fit-test-for-log-normal-and.html
Best Answer
Most goodness of fit tests are for the continuous case. There are, quite literally, hundreds of them. Besides the Kolmogorov-Smirnov test (for a fully specified distribution, based on maximum difference in ECDF) some commonly used ones include the Anderson-Darling test (also fully specified and ECDF based; a variance-weighted version of the Cramer-von Mises test) and the Shapiro-Wilk (parameters unspecified, for testing normality only).
Okay, but why? That is, why are you testing goodness of fit?
It's simply the sample version of the cdf. The cdf is $P(X\leq x)$, the ECDF is the same thing, with 'probability' (for the random variable) replaced with 'proportion' (of the data). That is, you compute the proportion of the data that is less than or equal to every value $x$ in the range (ECDFs only change at data values, but are still defined between them - you really only need to identify their value at each data point and to the left of the entire sample, since they're constant from each data point until the next data point)
Take a small set of numbers and try it.
Here we go, a sample of three data values:
now, for the following $x$ values, what is the proportion of the data $\leq x$?
(where $\varepsilon$ is some very small number)
Can you see how it works?
(Hint: the first five answers are 0, 0, 1/3, 1/3, 1/3 and the last one is 1; the full ECDF is plotted at the end of my answer)
What prompts you to use this example? Did something (a book, say, or a website) lead you to think you ought to use a goodness of fit test in this situation?
Empirical cdf of what?
Note that the KS is a test, not an estimate. What hypothesis are you testing and why?
No, they're quite different, as discussed below.
The likelihood for the regression tells you about fit of the line; in the case below, how close the red line is to the data.
You could replace the data with another set of values with the same summary statistics but a different distribution, and the likelihood would be identical.
See the Anscombe quartet for a good example of how very different data could have the same likelihood surface.
By contrast, With a goodness of fit test, you're checking the shape of some distribution, like a normal distribution with some mean and variance, fits the data (the KS measures the discrepancy from the hypothesized distribution by looking at the ECDF, giving a test that doesn't change when you transform both halves of the comparison - making it nonparametric):
So how does this relate to linear regression?
Some people try to test whether the assumption of normality around the line holds (such as the distribution in the green strip in the first plot), as a check on the assumption about the error distribution:
-- it's not clear from your description if that's what you mean to ask about, though.
However:
1) formally testing goodness of fit as a check on assumptions isn't necessarily suitable;
(i) it answers the wrong question (the relevant question is 'what is the impact on my inference of the degree of non-normality we have?'), and
(ii) only tells you anything when it's of almost no use to you to know it (goodness of fit tests tend to show significance in medium to large samples, where it usually doesn't matter much, and tend not to be significant in small samples where it matters most), and
(iii) changing what you do based on the outcome is usually less appropriate than simply assuming you'd reject the null in the first place (your regression inference doesn't have the desired properties).
2) even without all that, the KS is a test for a fully specified distribution. You have to specify the mean and standard deviation for each data point before you see any data. If you're estimating the mean (say by fitting a line) and a standard deviation (say by the standard error of the residuals, s), then you simply shouldn't be using the KS test.
There are tests for the situation where you estimate the mean and variance (the equivalent to the KS test is called the Lilliefors test), but for normality the standard is the Shapiro Wilk test (though the simpler Shapiro-Francia test is almost as powerful, most stats software implements the full Shapiro-Wilk test).
Well, basically you don't.
There is almost never a circumstance when that's a good choice for the situation you describe.
My suggestion is, to either use some procedure that doesn't assume normality (e.g. some robust approach, or perhaps least square but with inference based on resampling), or if you're in a position to reasonably assume normality, double-check the reasonableness of the assumption with a diagnostic display (like a Q-Q plot; incidentally the Shapiro-Francia test is effectively based on the $R^2$ in that plot).
In large samples, normality is less important to your inference (for everything but prediction intervals), so you can tolerate larger deviations from normality (equal variance and independence assumptions matter much more).
In small samples, you're more dependent on the assumption for your testing and confidence intervals, but you simply can't be sure how bad the degree of non-normality you have is. You're better with small samples to simply work as if your data were non-normal. (There are a number of good robust options, but you should usually also consider the potential impact of influential points, not just of potential y-outliers.)
ECDF for the small example data set earlier in the answer: