Solved – How to use the chi-squared test to determine if data follow the Poisson distribution

chi-squared-testdistributionspoisson distributionr

The figure below (Figure 1 from p. 646 of this paper) compares observed values against expected values under the Poisson distribution. It then runs a chi-squared test to see if the observed values differ from the expected values under Poisson distribution.

enter image description here

Using R, how is it possible to generate expected values under Poisson distribution and compare observed values using a chi-squared test?

EDIT:

Here's my attempt at doing what they did in paper. I want to know if the observed distribution of variable differs from a Poisson distribution. I also want to know if what I have done below is the same procedure as what they did in paper. As the P-value is >0.05, I have concluded below that the distribution of variable follows a Poisson distribution – could someone confirm this?

df <- data.frame(variable = 0:5, frequency = c(20, 10, 5, 3, 2, 1))

# estimate lambda
mean_df_variable <- mean(df$variable)

# calculate expected values if df$frequency follows a poisson distribution
library(plyr)
expected <- laply(0:5, function(x) dpois(x=x, lambda=mean_df_variable, log = FALSE))

# calculate actual distribution of df$frequency
observed <- df$frequency/sum(df$frequency)

# does distribution of df$frequency differ from a poisson distribution? Apparently 
#   not because P-value is > 0.05
chisq.test(expected, observed)

Best Answer

The way you did the chi-squared test is not correct. There are several issues. First, your data frame looks like this:

  variable frequency
1        0        20
2        1        10
3        2         5
4        3         3
5        4         2
6        5         1

So when you run mean(df$variable), you get 2.5, which is just the mean of 0:5. That is, it is unweighted. Instead, create your variable like this:

x = rep(0:5, times=c(20, 10, 5, 3, 2, 1))
table(x)
# x
#  0  1  2  3  4  5 
# 20 10  5  3  2  1
mean(x)
# [1] 1.02439

The table() call shows that the code gives us what we wanted, and so mean() estimates lambda correctly.

Next, your estimated probabilities only go to 5, but the Poisson distribution goes to infinity. So you need to account for the probabilities of the values that you don't have in your dataset. This is not hard to do, you just calculate the complement:

probs = dpois(0:5, lambda=mean(x))
probs
# [1] 0.359015310 0.367771781 0.188370912 0.064321775 0.016472650 0.003374884
comp = 1-sum(probs)
# [1] 0.0006726867

Lastly, in R's chisq.test() function, the x= and y= arguments aren't exactly for the expected and observed values in the way you set this up. For one thing, what you are calling "expected" are actually probabilities (i.e., the output from dpois()), to make these expected values, you would have to multiply those probabilities (and be sure to include the compliment) by the total count. But even then, you wouldn't use those for y=. At any rate, you don't actually have to do that, you can just assign the probabilities to the p= argument. In addition, you will need to add a 0 to your observed values vector to represent all of the possible values that don't show up in your dataset:

chisq.test(x=c(20, 10, 5, 3, 2, 1, 0), p=c(probs, comp))

#  Chi-squared test for given probabilities
# 
# data:  c(20, 10, 5, 3, 2, 1, 0)
# X-squared = 12.6058, df = 6, p-value = 0.04974
# 
# Warning message:
#   In chisq.test(x = c(20, 10, 5, 3, 2, 1, 0), p = c(probs, comp)) :
#   Chi-squared approximation may be incorrect

The warning message suggests we may prefer to simulate instead, so we try again:

chisq.test(x=c(20, 10, 5, 3, 2, 1, 0), p=c(probs, comp), simulate.p.value=TRUE)

# Chi-squared test for given probabilities with simulated p-value 
#   (based on 2000 replicates)
# 
# data:  c(20, 10, 5, 3, 2, 1, 0)
# X-squared = 12.6058, df = NA, p-value = 0.07046

This is presumably a more accurate p-value, but it raises a question about how it should be interpreted. You ask "As the P-value is >0.05, I have concluded below that the distribution of variable follows a Poisson distribution - could someone confirm this?" Using the correct approach, we note that the first p-value was just <.05, but the second (simulated) p-value was just >.05. Although the latter p-value is more accurate, I would not rush to conclude that the data did come from a Poisson distribution. Here are some facts to bear in mind:

As stated in the title of a paper by Gelman and Stern, The difference between "significant" and "non-significant" is not itself statistically significant.
Real data never come from any of the idealized distributions, a fact pointed out well by @Glen_b recently here: What distribution does this histogram look like?

Real data tends not to follow the simple distributional shapes of the common one-, two- or three- parameter distributions. Real distributions are more like heterogeneous mixtures. Simple distributional forms are convenient fictions (models, to be precise) - they approximate reality in ways that make it easier to work with.
You cannot use the fact of a non-significant result to affirm the null hypothesis, as I explain here: Why do statisticians say a non-significant result means "you can't reject the null" as opposed to accepting the null hypothesis?

Best Answer

Related Solutions

Solved – Proper use and interpretation of chi-squared test

Solved – How does chi-squared test of homogeneity differ from other chi-squared tests

Related Question