I need direction for a robust approach in R for deriving probability distributions for selected time points in event count time series.
In the below illustration:
- "Period_1" shows elapsed number of months,
- "1stStateX" shows the number of times elements in the population reach a state of X (it's
binary, either an element reaches X or not and it's a "dead state" in that an element can only hit X once), - "cumStateX" runs a running cumulative total of 1stStateX,
- "totalUnits" is the number of units in the population,
- "rateCumX" is the mean (or cumStateX/totalUnits), and then
- variance, standard deviation, and standard error; I show my calculation for these at the bottom.
For example, I highlight below period 18 where the cumulative mean rate of reaching X is 47.89%. I would like to derive a probability distribution around that 47.89% in period 18. Perhaps a Poisson Autoregressive Model or Autoregressive Conditional Poisson Model is the way to go for forecasting (I've been fiddling around with R packages ACP
and tscount
for forecasting), but at this point I'm more interested in simply drawing probability distributions around selected periods.
Could someone please advise on how to proceed for deriving point-in-time probability distributions, R packages that may help, etc.?
In the above, I calculate as follows using data.table
(calculations are obvious even if you don't know data.table
):
[rateCumX:= cumStateX/totalUnits]
[, var:=totalUnits * rateCumX * (1-rateCumX)]
[, sdv:=sqrt(var)]
[, serr:=sdv/sqrt(totalUnits)]
Revised standard error of the mean for binomial distribution ("serr")(1st 18 rows only of same data from original post), using data.table R package:
,rateCumX:= cumStateX/totalUnits][
,var:=totalUnits * rateCumX * (1-rateCumX)][
,sdv:=sqrt(var)][
,serr:=sdv/totalUnits]
Best Answer
From your description it sounds like a survival analysis approach would be a reasonable way to attack this problem. In particular, because you have a known number of individuals and each individual can be in exclusively one state at a time, it sounds like a multi-state survival model would work. If you're interested in reading about that I'd recommend the competing risk vignette for the R survival package. https://cran.r-project.org/web/packages/survival/vignettes/compete.pdf
For the question at hand, I'll interpret "robust" as meaning a non-parametric or assumption-free method. The Kaplan-Meier estimate is a natural generalization of what you've done so far. Here's an example of using a Kaplan-Meier estimate of the survival curve to generate confidence intervals for the proportion of individuals who will be in the "death" state at time t. From that it's easy to find confidence intervals for the number of individuals who will be in the "death" state.
Created on 2022-11-26 with reprex v2.0.2
Because the model predicts survival proportions, we need to take 1 minus these values to find the proportions in the death state.
Note that this confidence interval is not symmetric around the estimate. We can also compare with the calculation that I think you were trying to do, and see the result is quite comparable.
This isn't a probability distribution for the point-in-time estimates. The K-M approach is a non-parametric approach, so it doesn't need to assume a distribution. That's a strength and a weakness. If you have a distribution you think would be good fit, a model using that distribution will be able to make more accurate predictions and forecast into the future. The K-M estimate necessarily stops at the end of the data and can't forecast beyond the observed data.