Solved – Goodness of fit to Poisson Distribution

goodness of fitpoisson distributionprobability

What are some of the well known statistical tests to measure goodness of fit of observed random variables to a poisson distribution? I know the Kolmogorov-Smirnov test is one such, are there any others out there?

Best Answer

1) There are two issues with the Kolmogorov-Smirnov* -

a) it assumes the distribution is completely specified, with no estimated parameters. If you estimate parameters a KS becomes a form of Lilliefors test (in this case for Poisson-ness), and you need different critical values

b) it assumes the distribution is continuous

both impact the calculation of p-values, and both make it less likely to reject.

*(and the Cramer-von Mises and the Anderson Darling, and any other test that assumes a continuous, completely specified null)

Unless you don't mind a potentially highly-conservative test (of unknown size), you have to adjust the calculation of the significance for both of these; simulation would be called for.

2) on the other hand, a vanilla chi-square goodness of fit is a terrible idea when testing something that's ordered, as a Poisson is. By ignoring ordering, it's really not very sensitive to the more interesting alternatives - it throws away power against directly interesting alternatives like overdispersion, instead spending its power against things like 'an excess of even numbers over odd numbers'. As a result its power against interesting alternatives is generally even lower than the vanilla KS but without the compensation of the much lower type I error rate.

I think this is even worse.

3) on the gripping hand, you can partition the chi-squared into components that do respect the ordering via the use of orthogonal polynomials, and drop off the less interesting highest-order components. In this particular case you'd use polynomials orthogonal to the Poisson p.f.

This is an approach taken in Rayner and Best's little 1989 book on Smooth Tests of Goodness of Fit (they have a newer one on smooth tests in R that might make your life easier)

Alternatively, see papers like this one:

http://www.jstor.org/discover/10.2307/1403470

4) However, depending on why you're doing it, it may be better to reconsider the whole enterprise...

The discussion in questions like these carry over to most goodness of fit tests ... and indeed often to most tests of assumptions in general:

Is normality testing 'essentially useless'?

What tests do I use to confirm that residuals are normally distributed?

Related Solutions

Solved – How to test whether a sample of data fits the family of Gamma distribution

I think the question asks for a precise statistical test, not for an histogram comparison. When using the Kolmogorov-Smirnov test with estimated parameters, the distribution of the test statistics under the null depends on the tested distribution, as opposed to the case with no estimated parameter. For instance, using (in R)

x <- rnorm(100)
ks.test(x, "pnorm", mean=mean(x), sd=sd(x))

leads to

        One-sample Kolmogorov-Smirnov test

data:  x 
D = 0.0701, p-value = 0.7096
alternative hypothesis: two-sided

while we get

> ks.test(x, "pnorm")

        One-sample Kolmogorov-Smirnov test

data:  x 
D = 0.1294, p-value = 0.07022
alternative hypothesis: two-sided

for the same sample x. The significance level or the p-value thus have to be determined by Monte Carlo simulation under the null, producing the distribution of the Kolmogorov-Smirnov statistics from samples simulated under the estimated distribution (with a slight approximation in the result given that the observed sample comes from another distribution, even under the null).

Solved – Goodness-of-Fit for continuous variables

What are some Goodness of Fit tests or indicies for continuous case?

Most goodness of fit tests are for the continuous case. There are, quite literally, hundreds of them. Besides the Kolmogorov-Smirnov test (for a fully specified distribution, based on maximum difference in ECDF) some commonly used ones include the Anderson-Darling test (also fully specified and ECDF based; a variance-weighted version of the Cramer-von Mises test) and the Shapiro-Wilk (parameters unspecified, for testing normality only).

For example I am looking at Kolmogorov-Smirnov test.

Okay, but why? That is, why are you testing goodness of fit?

What I don't get is how one gets the emprical CDF in the first place?

It's simply the sample version of the cdf. The cdf is $P(X\leq x)$, the ECDF is the same thing, with 'probability' (for the random variable) replaced with 'proportion' (of the data). That is, you compute the proportion of the data that is less than or equal to every value $x$ in the range (ECDFs only change at data values, but are still defined between them - you really only need to identify their value at each data point and to the left of the entire sample, since they're constant from each data point until the next data point)

Take a small set of numbers and try it.

Here we go, a sample of three data values:

13.2  15.8  17.5

now, for the following $x$ values, what is the proportion of the data $\leq x$?

x = 10, 13.2-$\varepsilon$, 13.2, 13.2+$\varepsilon$, 15, 15.8, 17.5-$\varepsilon$, 19

(where $\varepsilon$ is some very small number)

Can you see how it works?

(Hint: the first five answers are 0, 0, 1/3, 1/3, 1/3 and the last one is 1; the full ECDF is plotted at the end of my answer)

What I mean is, let's say I do a regression analysis with gaussian errors.

What prompts you to use this example? Did something (a book, say, or a website) lead you to think you ought to use a goodness of fit test in this situation?

I have the maximum likelihood estimate of the parameters. Now I also need to do a density estimation for the emprical CDF?

Empirical cdf of what?

Note that the KS is a test, not an estimate. What hypothesis are you testing and why?

Aren't they the same thing?

No, they're quite different, as discussed below.

Isn't my likelihood already giving me a goodness of fit?

The likelihood for the regression tells you about fit of the line; in the case below, how close the red line is to the data.

enter image description here

You could replace the data with another set of values with the same summary statistics but a different distribution, and the likelihood would be identical.

See the Anscombe quartet for a good example of how very different data could have the same likelihood surface.

By contrast, With a goodness of fit test, you're checking the shape of some distribution, like a normal distribution with some mean and variance, fits the data (the KS measures the discrepancy from the hypothesized distribution by looking at the ECDF, giving a test that doesn't change when you transform both halves of the comparison - making it nonparametric):

enter image description here

So how does this relate to linear regression?

Some people try to test whether the assumption of normality around the line holds (such as the distribution in the green strip in the first plot), as a check on the assumption about the error distribution:

enter image description here

but this check is done across all x, not just some particular x (I did showed values near a particular $x$ to emphasize it's the conditional distribution of $y$ - or equivalently, the distribution of the errors - that is relevant).

-- it's not clear from your description if that's what you mean to ask about, though.

However:

1) formally testing goodness of fit as a check on assumptions isn't necessarily suitable;

(i) it answers the wrong question (the relevant question is 'what is the impact on my inference of the degree of non-normality we have?'), and

(ii) only tells you anything when it's of almost no use to you to know it (goodness of fit tests tend to show significance in medium to large samples, where it usually doesn't matter much, and tend not to be significant in small samples where it matters most), and

(iii) changing what you do based on the outcome is usually less appropriate than simply assuming you'd reject the null in the first place (your regression inference doesn't have the desired properties).

2) even without all that, the KS is a test for a fully specified distribution. You have to specify the mean and standard deviation for each data point before you see any data. If you're estimating the mean (say by fitting a line) and a standard deviation (say by the standard error of the residuals, s), then you simply shouldn't be using the KS test.

There are tests for the situation where you estimate the mean and variance (the equivalent to the KS test is called the Lilliefors test), but for normality the standard is the Shapiro Wilk test (though the simpler Shapiro-Francia test is almost as powerful, most stats software implements the full Shapiro-Wilk test).

Why do I need KS?

Well, basically you don't.

There is almost never a circumstance when that's a good choice for the situation you describe.

My suggestion is, to either use some procedure that doesn't assume normality (e.g. some robust approach, or perhaps least square but with inference based on resampling), or if you're in a position to reasonably assume normality, double-check the reasonableness of the assumption with a diagnostic display (like a Q-Q plot; incidentally the Shapiro-Francia test is effectively based on the $R^2$ in that plot).

In large samples, normality is less important to your inference (for everything but prediction intervals), so you can tolerate larger deviations from normality (equal variance and independence assumptions matter much more).

In small samples, you're more dependent on the assumption for your testing and confidence intervals, but you simply can't be sure how bad the degree of non-normality you have is. You're better with small samples to simply work as if your data were non-normal. (There are a number of good robust options, but you should usually also consider the potential impact of influential points, not just of potential y-outliers.)

ECDF for the small example data set earlier in the answer:

ECDF for 13.2, 15.8, 17.5

Best Answer

Related Solutions

Solved – How to test whether a sample of data fits the family of Gamma distribution

Solved – Goodness-of-Fit for continuous variables

Related Question