The figure below (Figure 1 from p. 646 of this paper) compares observed values against expected values under the Poisson distribution. It then runs a chi-squared test to see if the observed values differ from the expected values under Poisson distribution.
Using R, how is it possible to generate expected values under Poisson distribution and compare observed values using a chi-squared test?
EDIT:
Here's my attempt at doing what they did in paper. I want to know if the observed distribution of variable
differs from a Poisson distribution. I also want to know if what I have done below is the same procedure as what they did in paper. As the P-value is >0.05, I have concluded below that the distribution of variable
follows a Poisson distribution – could someone confirm this?
df <- data.frame(variable = 0:5, frequency = c(20, 10, 5, 3, 2, 1))
# estimate lambda
mean_df_variable <- mean(df$variable)
# calculate expected values if df$frequency follows a poisson distribution
library(plyr)
expected <- laply(0:5, function(x) dpois(x=x, lambda=mean_df_variable, log = FALSE))
# calculate actual distribution of df$frequency
observed <- df$frequency/sum(df$frequency)
# does distribution of df$frequency differ from a poisson distribution? Apparently
# not because P-value is > 0.05
chisq.test(expected, observed)
Best Answer
The way you did the chi-squared test is not correct. There are several issues. First, your data frame looks like this:
So when you run
mean(df$variable)
, you get2.5
, which is just the mean of0:5
. That is, it is unweighted. Instead, create your variable like this:The
table()
call shows that the code gives us what we wanted, and somean()
estimates lambda correctly.Next, your estimated probabilities only go to
5
, but the Poisson distribution goes to infinity. So you need to account for the probabilities of the values that you don't have in your dataset. This is not hard to do, you just calculate the complement:Lastly, in
R
'schisq.test()
function, thex=
andy=
arguments aren't exactly for the expected and observed values in the way you set this up. For one thing, what you are calling "expected" are actually probabilities (i.e., the output fromdpois()
), to make these expected values, you would have to multiply those probabilities (and be sure to include the compliment) by the total count. But even then, you wouldn't use those fory=
. At any rate, you don't actually have to do that, you can just assign the probabilities to thep=
argument. In addition, you will need to add a0
to your observed values vector to represent all of the possible values that don't show up in your dataset:The warning message suggests we may prefer to simulate instead, so we try again:
This is presumably a more accurate p-value, but it raises a question about how it should be interpreted. You ask "As the P-value is >0.05, I have concluded below that the distribution of variable follows a Poisson distribution - could someone confirm this?" Using the correct approach, we note that the first p-value was just <.05, but the second (simulated) p-value was just >.05. Although the latter p-value is more accurate, I would not rush to conclude that the data did come from a Poisson distribution. Here are some facts to bear in mind: