Cumulative Distribution Function – Probability Calculation of Total Duration for Sequential Events

cumulative distribution functiongamma distributionsum

Be patient, I am not very skilled with cdf.

I seem to have a seemingly simple problem for which I either can't seem to find material about or simply lack the vocabulary for.

Given are N sequential events with their own unique independent cdf that describe the probability of an event taking X minutes or less.
If I want to calculate the probability that all N events can be completed sequentially in less than Y minutes, what would I need to use?

I feel like a joint cdf will quickly get out of hand given an increasing N. Is there maybe a way to combine the input parameters of my distributions in such a way that I get a new cdf that describes the sum of the distributions? If it helps, I'm using Gamma/Erlang distributions only at the moment.

Thanks a lot for your input, and sorry if this is a duplicate.

Best Answer

The question "what is the probability that all events will take less than Y minutes?" is answered by the cdf at Y of the sum of the random variables. For independent Gamma distributions with different parameters, this is addressed in a couple of papers, most prominently "The distribution of the sum of independent gamma random variables" (P. G. Moschopoulos, 1985) . (Erlang distributions are a special case of the Gamma distribution, so the results should be applicable in this case.) The paper presents steps for computing an infinite series whose sum converges to the density function of the sum of RVs; to obtain the cdf you'll probably need to integrate numerically. The paper also provides error bounds to choose the number of series elements to compute in practice.

Alternatively, I would suggest, at least at first, to use a Monte Carlo method, and simply simulate many trials and use the empiric distribution of those trials. This can be done e.g. using numpy's RNG, and might give accurate enough results.

Edited to add sample code for a simulation:

import numpy as np

rng = np.random.default_rng()

# Say we have N=100 distributions, here I'm randomizing them
N = 100
k = rng.random(N) * 100
theta = rng.random(N) * 10

# Sample M trials, in this case 100000
M = 100000
X = rng.gamma(k, theta, size=(M, N))
# Calculate total duration of sequential events for each trial
y = X.sum(axis=1)

MAX_MINUTES = 23000

# Approximate CDF of sum at MAX_MINUTES
print((y <= MAX_MINUTES).mean())

This runs in approximately 2 seconds on my machine. Try different values of M to choose a tradeoff between runtime and accuracy.