Solved – Regression model for proportion or count when counts of outcome and total events are often zero

count-dataregression

I need help thinking about and identifying the kind of regression analysis that would be appropriate for this problem. Nothing I've discovered so far seems quite right. Referrals to articles or examples would be helpful. Thank you.

The data look like this:

  • The data are observational.

  • The sampling unit is a geographic location (EDIT: let's assume units are independent); I'm just trying to understand the basic analytical problem here).

  • At each sampling unit, there are events of two types: the event type of interest (A) and all other types (B). EDIT: In other words, each event is a binary outcome (success, failure). The outcomes are aggregated to the to the location level (Location 1: Success 3, Failure 2. Location 2: Success 0, Failure 1. Location 3: Success 0, Failure 0. Location 4: Success 4, Failure 9 …. etc.

  • Often, A=0 and, somewhat less often but still frequently, A+B=0.

I am interested in testing a hypothesis about A, somehow controlling for the total count of events (A+B), so either a proportion A/(A+B) or a count model that controls for the total count.

If I understand correctly, if the counts were large, proportions could be calculated for all units, and I could do beta regression. But that definitely can't happen when the total number of events for a unit is zero.

If all I cared about was the count, I could use a ZIP or other count model (and maybe still can). But the research question regards the frequency of A relative to the total number of events.

But how to control for the total number of events? Does it just go in the predictors of a ZIP or similar model? I suspect it's more complicated than that.

It seems obvious to me that the individual events could be modeled directly using multi-level logistic regression (EDIT: or another model for clustered data), but I'm wondering if there is a simpler way to examine what I'm interested in, and I just somehow haven't seen an example of this.

Best Answer

Probably the most common way to look at this kind of thing, if you're only interested in the proportions, is to assume that at the $i$th location $A_i$ & $B_i$ are independent Poisson variables with rates $\lambda_i$ & $\mu_i$ respectively. (That doesn't seem unreasonable for two types of car crashes at the same location over a limited period of time.) The joint mass function is

$$\newcommand{\e}{\mathrm{e}} f_{A_i,B_i}(a_i,b_i) = \frac{\lambda_i^{a_i} \e^{-\lambda_i}}{a_i!} \cdot \frac{\mu_i^{b_i} \e^{-\mu_i}}{b_i!}$$

Reparametrize with $$\pi_i = \frac{\lambda_i}{\lambda_i+\mu_i}$$ $$\nu_i= \mu_i+\lambda_i$$

, let $$N_i = A_i+B_i$$

, & the joint density can be written as

$$f_{A_i,N_i}(a_i,n_i)=\frac{1}{a_i!(n_i-a_i!)}\cdot\pi_i^{a_i} (1-\pi_i)^{n_i-a_i}\cdot \nu_i^{n_i} \e^{\nu_i}$$

Note that $\pi_i$, what you're interested in, & $\nu_i$, the nuisance parameter, separate cleanly; $N_i$ is sufficient for $\nu_i$, & $(A_i,N_i)$ sufficient for $\pi_i$. Sum over $a_i$ to get the marginal distribution of $N_i$, which is also Poisson, with rate $\nu_i$:

$$f_{N_i}(n_i)= \frac{\nu_i^{n_i} \e^{-\nu_i}}{n_i!}$$

Conditioning on the observed value of the ancillary complement $N_i=n_i$ gives

$$f_{A_i|N_i=n_i}(a_i;n_i)=\frac{n_i!}{a_i!(n_i-a_i!)}\cdot\pi_i^{a_i} (1-\pi_i)^{n_i-a_i}$$

, i.e. a binomial distribution for $A_i$ successes out of $n_i$ trials.

I'm not sure what your concern is about locations where there are no events—there's simply no data at these to estimate the proportion of type-A crashes because there weren't any crashes. That doesn't stop you estimating $\pi_i$ at other locations. If location is the only predictor you have a simple $2\times k$ contingency table for the $k$ locations with data. If there are continuous predictors you can use a logistic regression model. If you want to make estimates for the $n=0$ locations you need in some way to borrow information from other locations: e.g. with predictors whose coefficients are estimated from other locations, treating location as a random effect. A Bayesian multi-level model might be quite useful, as some locations will have small, though non-zero, event counts, & estimates for these will be pulled further in the direction of the global model.