Solved – What to do when count data does not fit a Poisson distribution

distributionsmodelingpoisson distributionr

I'm a PhD stats student. I'm working with a data set of count data. It's counts of users who are involved in an n-way real time chat conversation. The # of users range from 1 to 6 and there are approx 300 pieces of data in the set.

My initial motivation was to understand if the data would fit a Poisson distribution, my thinking being if a good fit was found i could use this result for further inference.

To cut a long story short, I attempted to fit the data and the fit fails at a 0.05 significance level. Thus I can reject my hypothesis (that a Poisson distribution can be used to approximate the data set).

When i look at a density plot, i believe the reason why there is such a poor fit is that there are "too many recorded values for 2 users. A Poisson distribution would fit better with less values in this bin. However as i compiled the data myself I have no reason to believe there are outliers (i.e. conversations with 2 users which would be assigned to a higher or lower bin)

users <- c(1, 2, 2, 1, 1, 1, 1, 2, 2, 3, 2, 2, 2, 1, 1, 1, 2, 2, 1, 1, 4, 3, 3, 3, 1,
        2, 1, 1, 2, 4, 3, 2, 2, 1, 2, 3, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 3,
        2, 1, 2, 3, 2, 1, 2, 1, 2, 1, 1, 3, 1, 1, 1, 2, 2, 2, 3, 1, 2, 1, 2, 4, 4,
        3, 2, 2, 3, 4, 3, 3, 3, 1, 2, 4, 2, 3, 3, 2, 4, 3, 1, 2, 4, 1, 2, 2, 2, 1,
        1, 1, 2, 3, 2, 4, 5, 2, 2, 4, 2, 2, 3, 3, 3, 2, 2, 3, 1, 3, 1, 1, 1, 2, 3,
        6, 3, 3, 4, 2, 2, 2, 3, 1, 1, 1, 2, 2, 3, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2,
        3, 3, 3, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 4, 3, 3, 2, 1, 2, 4, 1, 2, 1, 2, 2,
        2, 3, 2, 2, 2, 2, 2, 3, 2, 2, 1, 1, 3, 1, 2, 1, 2, 3, 4, 2, 4, 3, 2, 2, 1,
        4, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 3, 3, 1, 1, 2, 1, 2, 1, 3, 3, 3, 3,
        4, 6, 6, 5, 5, 2, 2, 3, 3, 3, 2, 3, 3, 4, 2, 3, 1, 3, 3, 1, 3, 2, 1, 3, 3,
        2, 1, 3, 1, 3, 2, 1, 1, 1, 1, 3, 1, 3, 4, 1, 4, 1, 3, 2, 3, 6, 2, 2, 3, 2,
        1, 2, 2, 2, 2, 2, 1, 2, 3, 2, 2, 4, 2, 2, 2, 3, 2, 2, 5, 3, 2, 2, 3, 2, 2,
        2, 5, 2, 1, 4, 1, 2, 2, 6, 1, 3, 2)


tu.fit <- goodfit(users,type="poisson", method = "MinChisq")
summary(tu.fit)
 Goodness-of-fit test for poisson distribution

             X^2 df     P(> X^2)
Pearson 69.37891  5 1.379945e-13

At undergrad level statistics I was taught that count data can be modeled by a Poisson distribution, but they never taught what to do when count data do not fit.

I'm not tied to the premise that my count data should fit Poisson (or any other distribution for that matter). However, I wanted to explore the efficacy of whether I should transform my dataset and fit to another discrete distribution or try some other approach (KDE) instead. Or should I simply conclude that my data is not a good fit for Poisson (or any other distribution) and leave it at that?

Best Answer

Let's consider, somewhat simplistically, the natural history of a conversation:

One person initiates a conversation by sending a message out into the ether.
People respond. Each new (unique) respondent adds one to the count.
Responses to any message are random: whether an individual responds depends on whether
- They are aware of the message
- Currently have an opportunity to respond
- Are interested in responding.
Compared to the number of people who could receive messages, the number of messages initiated is relatively low. Thus
- Almost all individuals will be responding to one, or a small manageable numbers, of messages at any time.

Characteristics (3) and (4) suggest a Poisson distribution could be a good model for the number of people responding to any message at any time: that is, the count minus one. What we do not know, and might not be safe assuming is whether all messages have approximately the same Poisson parameter or whether those parameters vary appreciably.

A good starting point, then, would be to test whether the counts minus one fit a Poisson distribution. Alternatively, they might fit some overdispersed distribution comprised of a mixture of Poissons.

The maximum likelihood estimate of the Poisson parameter $\lambda$ is the mean of the counts (minus one), equal to $1.20$. (It is important to use the ML estimate for this calculation rather than the "MinChisq" estimate computed by vcd::goodfit: see https://stats.stackexchange.com/a/17148/919.) Multiplying the Poisson probabilities by the total number of users gives the expected numbers of user counts. Here they are compared to the actual counts:

          0   1  2  3 4 5
Expected 94 113 68 27 8 2
Actual   85 127 68 22 5 5

The fit looks close. It can be measured with the chi-squared statistic, $$\chi^2 = \frac{(85-94)^2}{94} + \frac{(127-113)^2}{113} + \cdots + \frac{(5-2)^2}{2} = 9.61.$$

The six terms in this sum measure the individual count discrepancies. They are

     0    1    2    3    4    5 
  0.88 1.79 0.00 0.93 1.18 4.82

Values close to $1$ signify good agreement. Only the last value, $4.82$, is large. That is due to the small expected value of $2$ for a count of $5$. Typically, expected values less than $5$ are believed to lead to some unreliability in the traditional $\chi^2$ test: here, we should take the $\chi^2$ statistic to perhaps be a little inflated due to the small expected number of six-way conversations.

Nevertheless, this $\chi^2$ statistic is not terribly high: under the hypothesized unvarying Poisson distribution, this statistic would approximately follow a $\chi^2(5)$ distribution. That distribution tells us a value this high occurs almost nine percent of the time. We conclude there is little evidence of a departure from a constant Poisson distribution.

Incidentally, a plot of the data--in the sequence given--does hint at a variation in the counts. On average they increase a little from beginning to end, as the Lowess smooth on this plot suggests:

Thus, the chi-squared test of Poisson distribution should not be the last word: it should only be considered the beginning of a more detailed analysis.

Here is the R code used to perform the calculations and create the figure.

counts <- table(users-1)
mu <- mean(users-1)
expected <- dpois(as.numeric(names(counts)), mu) * length(users)
x <- (counts - expected)^2 / expected
print(round(x, 2)) # Terms in the chi-squared statistic
print(rbind(Expected = round(expected, 0), Actual=counts)) # Compare expected to actual

library(ggplot2)
X <- data.frame(Index=1:length(users), Count=users)
g <- ggplot(X, aes(Index, Count)) + geom_smooth(size=2) + geom_point(size=2, alpha=1/2)
print(g)

Methods of fitting discrete distributions

There are three main methods* used to fit (estimate the parameters of) discrete distributions.

1) Maximum Likelihood

This finds the parameter values that give the best chance of supplying your sample (given the other assumptions, like independence, constant parameters, etc)

2) Method of moments

This finds the parameter values that make the first few population moments match your sample moments. It’s often fairly easy to do, and in many cases yields fairly reasonable estimators. It’s also sometimes used to supply starting values to ML routines.

3) Minimum chi-square

This minimizes the chi-square goodness of fit statistic over the discrete distribution, though sometimes with larger data sets, the end-categories might be combined for convenience. It often works fairly well, and it even arguably has some advantages over ML in particular situations, but generally it must be iterated to convergence, in which case most people tend to prefer ML.

The first two methods are also used for continuous distributions; the third is usually not used in that case.

These by no means comprise an exhaustive list, and it would be quite possible to estimate parameters by minimizing the KS-statistic for example – and even (if you adjust for the discreteness), to get a joint consonance region from it, if you were so inclined. Since you’re working in R, ML estimation is quite easy to achieve for the negative binomial. If your sample were in x, it’s as simple as library(MASS);fitdistr (x,"negative binomial"):

> library(MASS) 
> x <- rnegbin(100,7,3)
> fitdistr (x,"negative binomial")
     size         mu    
  3.6200839   6.3701156 
 (0.8033929) (0.4192836)

Those are the parameter estimates and their (asymptotic) standard errors.

In the case of the Poisson distribution, MLE and MoM both estimate the Poisson parameter at the sample mean.

If you'd like to see examples, you should post some actual counts. Note that your histogram has been done with bins chosen so that the 0 and 1 categories are combined and we don't have the raw counts.

As near as I can guess, your data are roughly as follows:

    Count:  0&1   2   3   4   5   6  >6    
Frequency:  311 197  74  15   3   1   0

But the big numbers will be uncertain (it depends heavily on how accurately the low-counts are represented by the pixel-counts of their bar-heights) and it could be some multiple of those numbers, like twice those numbers (the raw counts affect the standard errors, so it matters whether they're about those values or twice as big)

The combining of the first two groups makes it a little bit awkward (it's possible to do, but less straightforward if you combine some categories. A lot of information is in those first two groups so it's best not to just let the default histogram lump them).

* Other methods of fitting discrete distributions are possible of course (one might match quantiles or minimise other goodness of fit statistics for example). The ones I mention appear to be the most common.

Best Answer

Related Solutions

Solved – Modelling a Poisson distribution with overdispersion

Solved – How to fit a discrete distribution to count data

Methods of fitting discrete distributions

1) Maximum Likelihood

2) Method of moments

3) Minimum chi-square

Related Question