R – Confidence Interval for Incidence Rate with Multiple Events per Person

confidence intervalr

I assume that most of the sources about incidence rates are assuming that there is only one event per person. In a medical context (e.g. cancer) this make sense.

I am able to compute the incidence rate and its CI if there are fewer or the same number of events as people. See the example code below. But when there are more events then people in population there are errors. I don't know why.

My hypothesis so far

  1. I did technical something wrong.
  2. Statistically my method is wrong. To compute the CI of an incidence rate if multiple events are possible you need another approach.
  3. Statistically a CI makes no sense when multiple events per person are possible.

I can compute the incidence rate and it's confidence interval with that R function

incidence_rate_with_CI <- function(observed_n, population_n, rate_relation, CI = 95) {
    # incidence rate
    ir = observed_n / population_n * rate_relation
    # Variance of raw incidence rate
    ir_variance = ir * (rate_relation - ir) / population_n

    nd = qnorm((1 + CI / 100) / 2, mean=0, sd=1)

    # Confidence interval
    ci_lower = ir - nd * sqrt(ir_variance)
    ci_upper = ir + nd * sqrt(ir_variance)

    return( list(ir=ir, lower=ci_lower, upper=ci_upper) )
}

An example calculation with half as many events as people in population.

> population = 1000000
> events = population / 2
> incidence_rate_with_CI(events, population, 1000, 95)
$ir
[1] 500

$lower
[1] 499.02

$upper
[1] 500.98

Mut let's make more events then people throws errors.

> population = 1000000
> events = population + 1
> incidence_rate_with_CI(events, population, 1000, 95)
$ir
[1] 1000.001

$lower
[1] NaN

$upper
[1] NaN

Warning messages:
1: In sqrt(ir_variance) : NaNs produced
2: In sqrt(ir_variance) : NaNs produced

Sidenote: There is not tag for incidence rate but incidence rate ratio. Is the latter a synonym for the first or is it something different?

Best Answer

Looking at multiple events per subject is very common. E.g. methods for the negative binomial distribution (or negative binomial regression) are very popular and often a better than assuming a (simpler) Poisson distribution (because rates can differ between subjects). With a log-time-at-risk offset negative binomial regression (and similarly Poisson regression) can take into account that different subjects might be at risk for different periods of time.

In that setting, you typically work with log-rates, because confidence intervals with normal approximations are better behaved on the log-scale (plus it reflects that rates cannot be negative). E.g. the MASS::glm.nb function from, but be aware that for small to moderate sample sizes it gives a too small SE (for a discussion of that and how to fix it, see here). E.g. for a single negative binomial rate, you could (ignoring the issues with SEs) get:

library(MASS)

negbinfit1 = glm.nb(data = data.frame(subject=c(1,2,3,4,5),
                                      events=c(100,50,150,10,75), 
                                      log_followup=log(c(2,1,2,0.5,1))),
                    formula = events ~ 1 + offset(log_followup))

summary(negbinfit1)

exp(summary(negbinfit1)$coef[1,1])
exp( summary(negbinfit1)$coef[1,1] - qnorm(0.975)*summary(negbinfit1)$coef[1,2] )
exp( summary(negbinfit1)$coef[1,1] + qnorm(0.975)*summary(negbinfit1)$coef[1,2] )