Solved – Dealing with zeros in a poisson regression

missing datapoisson distributionquality controlregression

Our code goes through multiple stages of review. I wish to use the number of defects at an earlier stage of review as a "defect density" estimate for later stages.

It sometimes happens that code has zero defects in the early stage of review. This is causing me trouble since if $\lambda = 0$ then $P(k)=\frac{e^{-\lambda t}(\lambda t)^k}{k!}=0$ for all $k$.

R does indeed just throw an error in this case:

foo = 0:10
bar = 2 * foo
glm(bar ~ log(foo), family = poisson)
# fails because log(0) = -Inf

I could get around this in several ways:

  • Ignore places with zeros (this would drop 1,486 of my 4,476 data points)
  • Replace all the zeros with some number
  • Add one to everything

What's the best way to handle this?

Best Answer

Either (as mentioned in a comment by @Glen_b) use Bayesian methods, or some kind of borrowing strength, that is, analyzing multiple data sets with a common model, with some common parameters (that can be seen as a way of empirical bayes.) You say

It sometimes happens that code has zero defects in the early stage of review

The estimated (mle) parameter of zero leads to predictions of zero future errors, though one could make prediction intervals (of the form $[0, .)$) for number of future errors, in some model. But borrowing strength by modeling multiple datasets together seems to me better. This is now a deveoped field, see https://en.wikipedia.org/wiki/List_of_software_reliability_models and this google scholar search

Related Question