Solved – How to know if a data follows a Poisson Distribution in R

poisson distributionpoisson processrself-study

I am an undergrad student and I have a project for my probability class. Basically, I have a dataset about the hurricanes that impacted my country for a series of years.

In my probability Book, (Probability and Statistics with R) there is an (not complete) example of how to check if the data follows a Poisson distribution, they begin trying to prove that these 3 criteria are followed: (From my book, page 120 (criteria) page 122-123 example)

1- The number of outcomes in non-overlapping intervals are independent. In other words, the number of outcomes in the interval of time (0,t] are independent from the number of outcomes in the interval of time (t, t+h], h > 0

2- The probability of two or more outcomes in a sufficiently short interval is virtually zero. In other words, provided h is sufficiently small, the probability of obtaining two or more outcomes in the interval (t,t+h] is negligible compared to the probability of obtaining one or zero outcomes in the same interval of time.

3- The probability of exactly one outcome in a sufficiently short interval or small region is proportional to the length of the interval or region. In other words, the probability of one outcome in an interval of length h is lambda*h.

But criterion 3 is left "as an exercise".

A- Can someone tell me if there is a more "easy" way to see if my dataset follows a Poisson distribution?

B- Can someone explain me criterion 1 and 3 with some type of example (if it is with R, fantastic)?

Thanks!

Note: Sorry for the long post. Also, I have to convert the data so that I have a table like:

  number of hurricanes       | 0 | 1 | 2  etc.
  -----------------------------------------
  total years that have      |   |   |
  that number of hurricanes  |   |   |

Best Answer

There are an infinite number of ways for a distribution to be slightly different from a Poisson distribution; you can't identify that a set of data is drawn from a Poisson distribution. What you can do is look for inconsistency with what you should see with a Poisson, but a lack of obvious inconsistency doesn't make it Poisson.

However, what you're talking about there by checking those three criteria isn't checking that the data come from a Poisson distribution by statistical means (i.e. by looking at data), but by assessing whether the process the data are generated by satisfies the conditions of a Poisson process; if the conditions all held or almost held (and that's a consideration of the data generating process), you could have something from or very close to a Poisson process, which would in turn be a way of getting data that's drawn from something close to a Poisson distribution.

But the conditions don't hold in several ways... and the furthest from being true is number 3. There's no particular reason on that basis to assert a Poisson process, though the violations may not be so bad that the resulting data are far from Poisson.

So we're back to statistical arguments that come from examining the data itself.

As mentioned at the start, what you can do is check whether the data aren't obviously inconsistent with the underlying distribution being Poisson, but that doesn't tell you they are drawn from a Poisson (you can already be confident that they're not).

You could do this check via goodness of fit tests.

The chi-square that was mentioned is one such, but I wouldn't recommend the chi-square test for this situation myself**; it has low power against what many people would see as the most relevant deviations (e.g. a change in dispersion relative to a Poisson with the same mean, changes in skewness etc). If your aim is to have good power against those sort of effects, you won't get it that way.

The main value of the chi-squared is in simplicity, and it has pedagogical value; outside that, it's usually not seen as particularly competitive as a goodness of fit test.

** Added in later edit: Now that it's clear this is homework, the chances that you're expected to do a chi-squared test to check the data are not inconsistent with a Poisson goes up quite a lot. See my example chi-square goodness of fit test, done below the first Poissonness plot


People often do these tests for the wrong reason (e.g. because they want to say 'therefore it's okay to do some other statistical thing with the data that assumes that the data are Poisson'). The real question there is 'how badly wrong could that go?' ... and the goodness of fit tests aren't really much help with that question. Often the answer to that question is at best one that's independent (/nearly independent) of sample size —— and in some cases, one with consequences that tend to go away with sample size ... while a goodness of fit test is useless with small samples (where your risk from violations of assumptions is often at its largest).

If you must test for a Poisson distribution there are a few reasonable alternatives. One would be to do something akin to an Anderson-Darling test, based on the AD statistic but using a simulated distribution under the null (to account for the twin problems of a discrete distribution and that you must estimate parameters).

A simpler alternative might be a Smooth Test for goodness of fit - these are a collection of tests designed for individual distributions by modelling the data using a family of polynomials which are orthogonal with respect to the probability function in the null. Low order (i.e. interesting) alternatives are tested by testing whether the coefficients of the polynomials above the base one are different from zero, and these can usually deal with parameter estimation by omitting the lowest order terms from the test. There's such a test for the Poisson. I can dig up a reference if you need it.

You might also use the correlation (or, to be more like a Shapiro-Francia test, perhaps $n(1-r^2)$) in a Poissonness plot - e.g. a plot of $\log(x_k)+\log(k!)$ vs $k$ (see Hoaglin, 1980) - as a test statistic.

Here's an example of that calculation (and plot), done in R:

y=rpois(100,5)
n=length(y)
(x=table(y))
y
 0  1  2  3  4  5  6  7  8  9 10 
 1  2  7 15 19 25 14  7  5  1  4 

k=as.numeric(names(x))
plot(k,c(log(x)+lfactorial(k)))

Poissonness plot, showing close to a straight line

Here's the statistic that I suggested could be used for a goodness of fit test of a Poisson:

n*(1-cor(k,log(x)+lfactorial(k))^2)
[1] 1.0599

Of course, to compute the p-value, you'd also need to simulate the distribution of the test statistic under the null (and I haven't discussed how one might deal with zero-counts inside the range of values). This should yield a reasonably powerful test. There are numerous other alternative tests.

Here's an example of doing a Poissonness plot on a sample of size 50 from a geometric distribution (p=.3):

Poissonness plot of geometric variate; it slopes downward at the start and then has a kink and heads up in a close to linear fashion

As you see, it displays a clear 'kink', indicating nonlinearity


References for the Poissonness plot would be:

David C. Hoaglin (1980),
"A Poissonness Plot",
The American Statistician
Vol. 34, No. 3 (Aug., ), pp. 146-149

and

Hoaglin, D. and J. Tukey (1985),
"9. Checking the Shape of Discrete Distributions",
Exploring Data Tables, Trends and Shapes,
(Hoaglin, Mosteller & Tukey eds)
John Wiley & Sons

The second reference contains an adjustment to the plot for small counts; you would probably want to incorporate it (but I don't have the reference to hand).


Example of doing a chi-square goodness of fit test:

Aside on performing the chi-square goodness of fit, the way it would usually be expected to be done in a lot of classes (though not the way I'd do it):

1: starting with your data, (which I will take to be the data I randomly generated in 'y' above, generate the table of counts:

(x=table(y))
y
 0  1  2  3  4  5  6  7  8  9 10 
 1  2  7 15 19 25 14  7  5  1  4 

2: compute the expected value in each cell, assuming a Poisson fitted by ML:

 (expec=dpois(0:10,lambda=mean(y))*length(y))
 [1]  0.7907054  3.8270142  9.2613743 14.9416838 18.0794374 17.5008954 14.1173890  9.7611661
 [9]  5.9055055  3.1758496  1.5371112

3: note that the end categories are small; this makes the chi-square distribution less good as an approximation to the distribution of the test statistic (a common rule is you want expected values of at least 5, though numerous papers have shown that rule to be unnecessarily restrictive; I'll take it close, but the general approach can be adapted to a stricter rule). Collapse adjacent categories, so that minimum expected values are at least not too far below 5 (one category with an expected count down near 1 out of more than 10 categories isn't too bad, two is pretty borderline). Also note that we haven't yet accounted for the probability beyond "10", so we also need to incorporate that:

expec[1]=sum(expec[1:2])
expec[2:8]=expec[3:9]
expec[9]=length(y)-sum(expec[1:8])
expec=expec[1:9]
expec
sum(expec) # now adds to n

4: similarly, collapse categories on the observed:

(obs=table(y))
obs[1]=sum(obs[1:2])
obs[2:8]=obs[3:9]
obs[9]=sum(obs[10:11])
obs=obs[1:9]

5: Put into a table, (optionally) along with the contribution to chi-square $(O_i-E_i)^2/E_i$ and the Pearson residual (the signed square root of the contribution), these can be useful when trying to see where it doesn't fit so well:

print(cbind(obs,expec,PearsonRes=(obs-expec)/sqrt(expec),ContribToChisq=(obs-expec)^2/expec),d=4)
  obs  expec PearsonRes ContribToChisq
0   3  4.618   -0.75282      0.5667335
1   7  9.261   -0.74308      0.5521657
2  15 14.942    0.01509      0.0002276
3  19 18.079    0.21650      0.0468729
4  25 17.501    1.79258      3.2133538
5  14 14.117   -0.03124      0.0009761
6   7  9.761   -0.88377      0.7810581
7   5  5.906   -0.37262      0.1388434
8   5  5.815   -0.33791      0.1141816

6: Compute $X^2 = \sum_i (E_i-O_i)^2/E_i$, with loss of 1df for expected total matching the observed total, and 1 more for estimating the parameter:

(chisq = sum((obs-expec)^2/expec))
[1] 5.414413
(df = length(obs)-1-1) # lose an additional df for parameter estimate
[1] 7
(pvalue=pchisq(chisq,df))
[1] 0.3904736

Both the diagnostics and the p-value show no lack of fit here ... which we'd expect, since the data we generated actually were Poisson.


Edit: here's a link to Rick Wicklin's blog which discusses the Poissonness plot, and talks about implementations in SAS and Matlab

http://blogs.sas.com/content/iml/2012/04/12/the-poissonness-plot-a-goodness-of-fit-diagnostic/


Edit2: If I have it right, the modified Poissonness plot from the 1985 reference would be*:

y=rpois(100,5)
n=length(y)
(x=table(y))
k=as.numeric(names(x))
x=as.vector(x)
x1 = ifelse(x==0,NA,ifelse(x>1,x-.8*x/n-.67,exp(-1)))
plot(k,log(x1)+lfactorial(k))

* They actually adjust the intercept as well, but I haven't done so here; it doesn't affect the appearance of the plot, but you have to take care if you implement anything else from the reference (such as the confidence intervals) if you do it at all differently from their approach.

(For the above example the appearance hardly changes from the first Poissonness plot.)

Related Question