Solved – What to do when count data does not fit a Poisson distribution

distributionsmodelingpoisson distributionr

I'm a PhD stats student. I'm working with a data set of count data. It's counts of users who are involved in an n-way real time chat conversation. The # of users range from 1 to 6 and there are approx 300 pieces of data in the set.

My initial motivation was to understand if the data would fit a Poisson distribution, my thinking being if a good fit was found i could use this result for further inference.

To cut a long story short, I attempted to fit the data and the fit fails at a 0.05 significance level. Thus I can reject my hypothesis (that a Poisson distribution can be used to approximate the data set).

When i look at a density plot, i believe the reason why there is such a poor fit is that there are "too many recorded values for 2 users. A Poisson distribution would fit better with less values in this bin. However as i compiled the data myself I have no reason to believe there are outliers (i.e. conversations with 2 users which would be assigned to a higher or lower bin)

users <- c(1, 2, 2, 1, 1, 1, 1, 2, 2, 3, 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 4, 3, 3, 3, 1,
        2, 1, 1, 2, 4, 3, 2, 2, 1, 2, 3, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 3,
        2, 1, 2, 3, 2, 1, 2, 1, 2, 1, 1, 3, 1, 1, 1, 2, 2, 2, 3, 1, 2, 1, 2, 4, 4,
        3, 2, 2, 3, 4, 3, 3, 3, 1, 2, 4, 2, 3, 3, 2, 4, 3, 1, 2, 4, 1, 2, 2, 2, 1,
        1, 1, 2, 3, 2, 4, 5, 2, 2, 4, 2, 2, 3, 3, 3, 2, 2, 3, 1, 3, 1, 1, 1, 2, 3,
        6, 3, 3, 4, 2, 2, 2, 3, 1, 1, 1, 2, 2, 3, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2,
        3, 3, 3, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 4, 3, 3, 2, 1, 2, 4, 1, 2, 1, 2, 2,
        2, 3, 2, 2, 2, 2, 2, 3, 2, 2, 1, 1, 3, 1, 2, 1, 2, 3, 4, 2, 4, 3, 2, 2, 1,
        4, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 3, 3, 1, 1, 2, 1, 2, 1, 3, 3, 3, 3,
        4, 6, 6, 5, 5, 2, 2, 3, 3, 3, 2, 3, 3, 4, 2, 3, 1, 3, 3, 1, 3, 2, 1, 3, 3,
        2, 1, 3, 1, 3, 2, 1, 1, 1, 1, 3, 1, 3, 4, 1, 4, 1, 3, 2, 3, 6, 2, 2, 3, 2,
        1, 2, 2, 2, 2, 2, 1, 2, 3, 2, 2, 4, 2, 2, 2, 3, 2, 2, 5, 3, 2, 2, 3, 2, 2,
        2, 5, 2, 1, 4, 1, 2, 2, 6, 1, 3, 2)


tu.fit <- goodfit(users,type="poisson", method = "MinChisq")
summary(tu.fit)
 Goodness-of-fit test for poisson distribution

             X^2 df     P(> X^2)
Pearson 69.37891  5 1.379945e-13

At undergrad level statistics I was taught that count data can be modeled by a Poisson distribution, but they never taught what to do when count data do not fit.

I'm not tied to the premise that my count data should fit Poisson (or any other distribution for that matter). However, I wanted to explore the efficacy of whether I should transform my dataset and fit to another discrete distribution or try some other approach (KDE) instead. Or should I simply conclude that my data is not a good fit for Poisson (or any other distribution) and leave it at that?

Best Answer

Let's consider, somewhat simplistically, the natural history of a conversation:

  1. One person initiates a conversation by sending a message out into the ether.

  2. People respond. Each new (unique) respondent adds one to the count.

  3. Responses to any message are random: whether an individual responds depends on whether

    • They are aware of the message
    • Currently have an opportunity to respond
    • Are interested in responding.
  4. Compared to the number of people who could receive messages, the number of messages initiated is relatively low. Thus

    • Almost all individuals will be responding to one, or a small manageable numbers, of messages at any time.

Characteristics (3) and (4) suggest a Poisson distribution could be a good model for the number of people responding to any message at any time: that is, the count minus one. What we do not know, and might not be safe assuming is whether all messages have approximately the same Poisson parameter or whether those parameters vary appreciably.

A good starting point, then, would be to test whether the counts minus one fit a Poisson distribution. Alternatively, they might fit some overdispersed distribution comprised of a mixture of Poissons.

The maximum likelihood estimate of the Poisson parameter $\lambda$ is the mean of the counts (minus one), equal to $1.20$. (It is important to use the ML estimate for this calculation rather than the "MinChisq" estimate computed by vcd::goodfit: see https://stats.stackexchange.com/a/17148/919.) Multiplying the Poisson probabilities by the total number of users gives the expected numbers of user counts. Here they are compared to the actual counts:

          0   1  2  3 4 5
Expected 94 113 68 27 8 2
Actual   85 127 68 22 5 5

The fit looks close. It can be measured with the chi-squared statistic, $$\chi^2 = \frac{(85-94)^2}{94} + \frac{(127-113)^2}{113} + \cdots + \frac{(5-2)^2}{2} = 9.61.$$

The six terms in this sum measure the individual count discrepancies. They are

     0    1    2    3    4    5 
  0.88 1.79 0.00 0.93 1.18 4.82  

Values close to $1$ signify good agreement. Only the last value, $4.82$, is large. That is due to the small expected value of $2$ for a count of $5$. Typically, expected values less than $5$ are believed to lead to some unreliability in the traditional $\chi^2$ test: here, we should take the $\chi^2$ statistic to perhaps be a little inflated due to the small expected number of six-way conversations.

Nevertheless, this $\chi^2$ statistic is not terribly high: under the hypothesized unvarying Poisson distribution, this statistic would approximately follow a $\chi^2(5)$ distribution. That distribution tells us a value this high occurs almost nine percent of the time. We conclude there is little evidence of a departure from a constant Poisson distribution.

Incidentally, a plot of the data--in the sequence given--does hint at a variation in the counts. On average they increase a little from beginning to end, as the Lowess smooth on this plot suggests:

Plot

Thus, the chi-squared test of Poisson distribution should not be the last word: it should only be considered the beginning of a more detailed analysis.


Here is the R code used to perform the calculations and create the figure.

counts <- table(users-1)
mu <- mean(users-1)
expected <- dpois(as.numeric(names(counts)), mu) * length(users)
x <- (counts - expected)^2 / expected
print(round(x, 2)) # Terms in the chi-squared statistic
print(rbind(Expected = round(expected, 0), Actual=counts)) # Compare expected to actual

library(ggplot2)
X <- data.frame(Index=1:length(users), Count=users)
g <- ggplot(X, aes(Index, Count)) + geom_smooth(size=2) + geom_point(size=2, alpha=1/2)
print(g)
Related Question