Solved – How to use the chi-squared test to determine if data follow the Poisson distribution

chi-squared-testdistributionspoisson distributionr

The figure below (Figure 1 from p. 646 of this paper) compares observed values against expected values under the Poisson distribution. It then runs a chi-squared test to see if the observed values differ from the expected values under Poisson distribution.

enter image description here

Using R, how is it possible to generate expected values under Poisson distribution and compare observed values using a chi-squared test?

EDIT:

Here's my attempt at doing what they did in paper. I want to know if the observed distribution of variable differs from a Poisson distribution. I also want to know if what I have done below is the same procedure as what they did in paper. As the P-value is >0.05, I have concluded below that the distribution of variable follows a Poisson distribution – could someone confirm this?

df <- data.frame(variable = 0:5, frequency = c(20, 10, 5, 3, 2, 1))

# estimate lambda
mean_df_variable <- mean(df$variable)

# calculate expected values if df$frequency follows a poisson distribution
library(plyr)
expected <- laply(0:5, function(x) dpois(x=x, lambda=mean_df_variable, log = FALSE))

# calculate actual distribution of df$frequency
observed <- df$frequency/sum(df$frequency)

# does distribution of df$frequency differ from a poisson distribution? Apparently 
#   not because P-value is > 0.05
chisq.test(expected, observed)

Best Answer

The way you did the chi-squared test is not correct. There are several issues. First, your data frame looks like this:

  variable frequency
1        0        20
2        1        10
3        2         5
4        3         3
5        4         2
6        5         1

So when you run mean(df$variable), you get 2.5, which is just the mean of 0:5. That is, it is unweighted. Instead, create your variable like this:

x = rep(0:5, times=c(20, 10, 5, 3, 2, 1))
table(x)
# x
#  0  1  2  3  4  5 
# 20 10  5  3  2  1
mean(x)
# [1] 1.02439

The table() call shows that the code gives us what we wanted, and so mean() estimates lambda correctly.

Next, your estimated probabilities only go to 5, but the Poisson distribution goes to infinity. So you need to account for the probabilities of the values that you don't have in your dataset. This is not hard to do, you just calculate the complement:

probs = dpois(0:5, lambda=mean(x))
probs
# [1] 0.359015310 0.367771781 0.188370912 0.064321775 0.016472650 0.003374884
comp = 1-sum(probs)
# [1] 0.0006726867

Lastly, in R's chisq.test() function, the x= and y= arguments aren't exactly for the expected and observed values in the way you set this up. For one thing, what you are calling "expected" are actually probabilities (i.e., the output from dpois()), to make these expected values, you would have to multiply those probabilities (and be sure to include the compliment) by the total count. But even then, you wouldn't use those for y=. At any rate, you don't actually have to do that, you can just assign the probabilities to the p= argument. In addition, you will need to add a 0 to your observed values vector to represent all of the possible values that don't show up in your dataset:

chisq.test(x=c(20, 10, 5, 3, 2, 1, 0), p=c(probs, comp))

#  Chi-squared test for given probabilities
# 
# data:  c(20, 10, 5, 3, 2, 1, 0)
# X-squared = 12.6058, df = 6, p-value = 0.04974
# 
# Warning message:
#   In chisq.test(x = c(20, 10, 5, 3, 2, 1, 0), p = c(probs, comp)) :
#   Chi-squared approximation may be incorrect

The warning message suggests we may prefer to simulate instead, so we try again:

chisq.test(x=c(20, 10, 5, 3, 2, 1, 0), p=c(probs, comp), simulate.p.value=TRUE)

# Chi-squared test for given probabilities with simulated p-value 
#   (based on 2000 replicates)
# 
# data:  c(20, 10, 5, 3, 2, 1, 0)
# X-squared = 12.6058, df = NA, p-value = 0.07046

This is presumably a more accurate p-value, but it raises a question about how it should be interpreted. You ask "As the P-value is >0.05, I have concluded below that the distribution of variable follows a Poisson distribution - could someone confirm this?" Using the correct approach, we note that the first p-value was just <.05, but the second (simulated) p-value was just >.05. Although the latter p-value is more accurate, I would not rush to conclude that the data did come from a Poisson distribution. Here are some facts to bear in mind: