Distributions – How to Choose the Most Appropriate Distribution for Modeling

distributionsmodelmodelinguniform distribution

I am given the following problem:

A student S wants to take the tram to go home after his lectures are
over. The tram line he’s used to take leaves every 7.5 minutes on
average at the university. Unfortunately, student S cannot remember
the departure times. Thus, he always arrives randomly at the tram
station every single weekday.

I am then asked the question which distribution is the most appropriate one to model the daily waiting times of the student.

The random variable $X$ reflects the daily waiting time. I thought about an uniform distribution with $a=0$ and $b=7.5$ since the student arrives randomly at the tram station and thus has equal probability for every waiting time between $0$ and $7.5$.

So far so good. What is confusing me though is the fact that the arrival of the tram is again a random variable (the tram only arrives every 7.5 min on average). So I am not exactly sure how to model this as the upper bound $b$ of the uniform distribution would not always be exactly $7.5$ but rather a realisation of the underlying distribution.

I hope someone can explain how I can model this accurately.

Best Answer

It's perhaps a little more complex than it appears, requiring multiple steps.

First, let's define the train interarrival time ($x$) cumulative distribution function as $F(x)$, where $\int xf(x)dx = 7.5$. Now, it should be intuitively clear that given, say, two interarrival times $x_1$ and $x_2$, the probability that our hapless student arrives during interval $x_1$ is proportional to the length of $x_1$, and similarly for $x_2$. Since the probability of an interval being of length $x_1$ in the first place $ = f(x_1)$, we can see that:

$$p(\text{Observed interarrival time} = x) \propto xf(x)$$

Integrating $xf(x)$ to find the constant of proportionality (well, the part not already hidden in $f(x)$) gives us $7.5$, the mean interarrival time. So we have:

$$p(x) = {xf(x) \over 7.5}$$

(We will use that $7.5$ later on.) Now, if the student arrives randomly during an interval of length $x$, the arrival time is uniformly distributed over $(0,x)$, which of course implies the remaining time $t$ in the interval - i.e., the time until the next train arrives - is also distributed uniformly over $(0,x)$. Therefore, $p(t|x) = (1/x)\, 1(t<x)$, where $1(a)$ is the indicator function taking on the value $1$ if the condition $a$ is true, $0$ otherwise.

Combining the two expressions gives us:

$$p(t, x) = p(t|x)p(x) = {1 \over x}{xf(x) \over 7.5}1(t<x) = {1 \over 7.5}f(x)1(t<x)$$

Now we want to integrate out $x$ so we can get the marginal distribution of $t$. The indicator function makes it clear that the appropriate range of integration of $x$ for any given $t$ is from $t$ to $\infty$, as for $x < t$ the function being integrated will equal $0$.

$$p(t) = {1 \over 7.5}\int_t^{\infty}f(x)dx = {1 \over 7.5}(1-F(t))$$

A quick check: $\int_0^{\infty}(1-F(t))dt = \mathbb{E}[t] = 7.5$ (this is a moderately well-known relationship), so we have ${1 \over 7.5}\int_0^{\infty}(1-F(t))dt = 1$ and we have derived a proper probability distribution. (Of course, the more general solution is $p(t) = (1-F(t))/\mathbb{E}[t]$.)

I have ignored the issue of what happens if the student arrives exactly when the train leaves - does he get on the train, in which case the uniform distribution of the waiting time conditional on $x$ is over $[0, x)$, or does she have to wait for the next one, in which case it's $(0,x]$, with appropriate changes to the indicator function etc. Fortunately the difference amounts to a set of measure zero, i.e., the probability of that occurring equals zero, so our final result holds either way.

Related Question