R Programming – How to Derive Probability Distributions for Event Count Time Series

autoregressivedistributionsprobabilityrtime series

I need direction for a robust approach in R for deriving probability distributions for selected time points in event count time series.

In the below illustration:

  1. "Period_1" shows elapsed number of months,
  2. "1stStateX" shows the number of times elements in the population reach a state of X (it's
    binary, either an element reaches X or not and it's a "dead state" in that an element can only hit X once),
  3. "cumStateX" runs a running cumulative total of 1stStateX,
  4. "totalUnits" is the number of units in the population,
  5. "rateCumX" is the mean (or cumStateX/totalUnits), and then
  6. variance, standard deviation, and standard error; I show my calculation for these at the bottom.

For example, I highlight below period 18 where the cumulative mean rate of reaching X is 47.89%. I would like to derive a probability distribution around that 47.89% in period 18. Perhaps a Poisson Autoregressive Model or Autoregressive Conditional Poisson Model is the way to go for forecasting (I've been fiddling around with R packages ACP and tscount for forecasting), but at this point I'm more interested in simply drawing probability distributions around selected periods.

Could someone please advise on how to proceed for deriving point-in-time probability distributions, R packages that may help, etc.?

enter image description here

In the above, I calculate as follows using data.table (calculations are obvious even if you don't know data.table):

[rateCumX:= cumStateX/totalUnits]
[, var:=totalUnits * rateCumX * (1-rateCumX)]
[, sdv:=sqrt(var)]
[, serr:=sdv/sqrt(totalUnits)]

Revised standard error of the mean for binomial distribution ("serr")(1st 18 rows only of same data from original post), using data.table R package:

,rateCumX:= cumStateX/totalUnits][
,var:=totalUnits * rateCumX * (1-rateCumX)][
,sdv:=sqrt(var)][
,serr:=sdv/totalUnits] 

enter image description here

Best Answer

From your description it sounds like a survival analysis approach would be a reasonable way to attack this problem. In particular, because you have a known number of individuals and each individual can be in exclusively one state at a time, it sounds like a multi-state survival model would work. If you're interested in reading about that I'd recommend the competing risk vignette for the R survival package. https://cran.r-project.org/web/packages/survival/vignettes/compete.pdf

For the question at hand, I'll interpret "robust" as meaning a non-parametric or assumption-free method. The Kaplan-Meier estimate is a natural generalization of what you've done so far. Here's an example of using a Kaplan-Meier estimate of the survival curve to generate confidence intervals for the proportion of individuals who will be in the "death" state at time t. From that it's easy to find confidence intervals for the number of individuals who will be in the "death" state.

library(survival)
n <- 1280
event_time <- c(
  rep(8, 10), rep(9, 112), rep(10, 114),
  rep(11, 69), rep(12, 59), rep(13, 77),
  rep(14, 45), rep(15, 42), rep(16, 32),
  rep(17, 26), rep(18, 27), rep(19, 21),
  rep(20, 19), rep(21, 22), rep(22, 24),
  rep(23, 8), rep(24, 10), rep(25, 16),
  rep(26, 17), rep(27, 12), rep(28, 5),
  rep(29, 5), rep(30, 4), rep(31, 4),
  rep(32, 4), rep(33, 4), rep(34, 4),
  rep(35, 8), rep(36, 4), rep(37, 3),
  rep(38, 6), rep(39, 5), rep(40, 0),
  rep(41, n - 818))
event_ind <- c(rep(1, 818), rep(0, n - 818))
mod <- survfit(Surv(event_time, event_ind) ~ 1)
plot(mod, ylab = "surviving proportion", xlab = "time")

Created on 2022-11-26 with reprex v2.0.2

Because the model predicts survival proportions, we need to take 1 minus these values to find the proportions in the death state.

indx <- which(mod$time == 18)
# "Death" Probabilities
c(lower = 1 - mod$upper[indx], 
  est = 1 - mod$surv[indx],
  upper = 1 - mod$lower[indx])
#>     lower       est     upper 
#> 0.4508079 0.4789062 0.5055670
# Mean number of units in state 2
c(lower = (1 - mod$upper[indx])*n, 
  est = (1 - mod$surv[indx])*n,
  upper = (1 - mod$lower[indx])*n)
#>    lower      est    upper 
#> 577.0341 613.0000 647.1257

Note that this confidence interval is not symmetric around the estimate. We can also compare with the calculation that I think you were trying to do, and see the result is quite comparable.

# Binomial Variance calculation
p_est <- 613/n
var <- p_est*(1-p_est)*n
sdv <- sqrt(var)
sderr <- sdv/n
c(p_est - 1.96*sderr, p_est, p_est + 1.96*sderr)
#> 0.4515388 0.4789062 0.5062737
c((p_est - 1.96*sderr)*n, p_est*n, 
  (p_est + 1.96*sderr)*n)
#> 577.9697 613.0000 648.0303

This isn't a probability distribution for the point-in-time estimates. The K-M approach is a non-parametric approach, so it doesn't need to assume a distribution. That's a strength and a weakness. If you have a distribution you think would be good fit, a model using that distribution will be able to make more accurate predictions and forecast into the future. The K-M estimate necessarily stops at the end of the data and can't forecast beyond the observed data.

Related Question