Solved – GLM – which probability distribution to use for abundance data

aicecologygamma distributiongeneralized linear modelpoisson distribution

I'm fitting a generalized linear model to try to understand how the abundance of a species of freshwater fish varies in response to some environmental variables. I'm using the AIC to choose between models. My main question is which family of probability distribution to use, Poisson or Gamma?

When I use Poisson, I can't get an AIC value for the null model. The message that appears is: AIC: Inf.

The summary output is this:

Call:
glm(formula = Lampetra ~ 1, family = poisson(link = log), data = cont)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3833  -1.0154  -0.3811   0.1948   2.4742  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.08295    0.22009   0.377    0.706

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 22.079  on 18  degrees of freedom
Residual deviance: 22.079  on 18  degrees of freedom
AIC: Inf

Number of Fisher Scoring iterations: 5

Best Answer

The Gamma distribution has support on non-negative real numbers, i.e. it is for continuous data between 0 and $+\infty$. The Poisson has support on the the non-negative integer numbers. We normally record abundance as numbers of individuals, a count, and therefore a discrete distribution like the Poisson would be a reasonable starting point for modelling. The Gamma would be unsuitable as it is for continuous data and we often don't see 2.4 fish.

The infinite AIC suggests there may have been a problem with the fitting of the model or perhaps your data are not Poisson distributed (conditional upon the values of $X$). It's difficult to diagnose potential problems with so little information to go on.

Related Solutions

Solved – Very large theta values using glm.nb in R – alternative approaches

It doesn't necessarily mean that there is overdispersion (though it could), just that a saturated model may be a better fit. If you only have 7-9 observations, it will be very difficult to accurately test for overdispersion unless you have some values that are just way out there under a Poisson assumption.

Another option you might look into is using the Poisson model but using a transformed value of your predictor rather than a linear fit on the raw variable. If it looks like the larger values of the predictor are where the Y-values are off more, you could try using something like a squared value of the predictor, or if it's the opposite then maybe a log-transform of the predictor.

Thinking about overdispersion in a count model is always a good idea, but it does introduce complexity into the model. With so few data points, your best approach might be to keep it as simple as possible.

Solved – Which distribution to use for a probability problem

The following analysis illustrates one approach to obtaining a solution. At least it might help show how to work with the Poisson distribution.

To answer this question constructively and clearly, let's make a few simplifying assumptions to avoid getting bogged down in details that haven't been described. For instance, you might choose to

assume that a "breakdown" is an event with such a short duration that a machine is back in operation immediately after a breakdown; and
therefore the same machine could break down multiple times during a week (although this might be a rare event).

As apparently intended by the question, we will make some additional stronger assumptions. Some such (modeling) assumptions are needed to make any progress at all with the answer. Their chief purpose would be to give us a point of departure for eliciting additional information from the plant engineers so we could develop improved models and better answers:

All machines independently have the same chances of breaking down and
those chances do not vary over time.

These assumptions imply the number of breakdowns observed among any number $N$ of machines during any period of $x$ weeks has a Poisson$(\lambda N x)$ distribution, where $\lambda$ is a parameter common to all machines at all times. The question tells us about the breakdown rate for $x = 1$ week:

$$\lambda N 1\text{ weeks} = 2.$$

Therefore

$$\lambda = 2 / (N \text{ machine-weeks}).$$

In a random sample of $26$ such machines, the number of breakdowns in a week will have a Poisson distribution with parameter

$$\mu = \lambda\times (26\text{ machines})\times (1\text{ week}) = 26\lambda = 52/N.$$

From the formula for Poisson probabilities, the chance of no breakdowns among these $26$ machines is

$$e^{-\mu} 0! = e^{-\mu} = e^{-26\lambda} = e^{-52/N}.$$

Since $N\ge 26$, this value cannot exceed $e^{-52/26}=e^{-2}\approx 0.135$, but as $N$ grows large it could become arbitrarily small.

This is not a final answer. It only shows the implication of four assumptions that were made upon interpreting the question in terms of the chance of no breakdowns in a week. (Other interpretations of the question are possible, due to the contorted syntax used to pose it.) In particular, the dependence upon the unknown total number of machines is clear and explicit. This is about as far as one can go, given the limited information supplied in the question.

A simulation (covering almost 200 years of operation) illustrates the ideas. Its output consists of two histograms: the weekly breakdown counts for all $N$ machines and the counts for the sample of the machines. Here is an example for $N=60$:

On each histogram are drawn two vertical lines: a gray one indicating the location of the actual rate (as given by the preceding solution) and a dashed red one indicating the average rate during the simulation. In each case those lines are visibly coincident, showing that the simulation and the preceding analysis are in agreement.

Studying the R code that produced this simulation may help clarify the ideas.

n <- 60           # Number of machines
sample.size <- 26 # Must be less than or equal to n
weekly.mean <- 2  # Events per week, on average
n.iter <- 1e4     # Size of this simulation in weeks
set.seed(17)      # Reproduce these results exactly
#
# Simulate all machines.
#
lambda <- weekly.mean/n                          # Weekly breakdown rate per machine
x <- matrix(rpois(n.iter*n, lambda), nrow=n)     # Breakdowns by machine by week
weekly.breakdowns <- colSums(x)                  # Total breakdowns each week
sample.breakdowns <- colSums(x[1:sample.size, ]) # Total breakdowns in the sample
#
# Plot the results.
#
par(mfrow=c(1,2))
eps <- 0.99
hist(weekly.breakdowns, breaks=(-1):max(weekly.breakdowns)+eps,
     freq=FALSE, cex.main=0.9)
abline(v=lambda * n, lwd=2, col="Gray")
abline(v=mean(weekly.breakdowns), col="Red", lwd=3, lty=3)

mu <- weekly.mean * sample.size / n
hist(sample.breakdowns, breaks=(-1):max(sample.breakdowns)+eps,
     freq=FALSE, cex.main=0.9)
abline(v=mu * n, lwd=2, col="Gray")
abline(v=mean(sample.breakdowns), col="Red", lwd=3, lty=3)

Best Answer

Related Solutions

Solved – Very large theta values using glm.nb in R – alternative approaches

Solved – Which distribution to use for a probability problem

Related Question