Solved – Poisson / exponential distribution

exponential distributionpoisson distributionself-study

Next weekend you will be participating in 12km cross country race on a mountain.The average time between two successive wild animal sightings on the mountain is reported to be 5 minutes

(a) What is the probability that you see at least one wild animal in the 11th minute of the race given that you will see 3 wild animals since the start of the race?

(b) What is the probability that it will take more than a quarter of a hour before you see a wild animal after ten minutes of running?

Now I have attempted (a) , but I don't know whether my thinking is correct. (b) on the other hand makes little sense to me.

My attempt:

(a)$ X~ Poisson(\frac{1}{5})$ and $Y~Exponential(\frac{1}{5})$

$Pr(X>4) = Pr(X>1)$ (I am thinking that this is some variation of the memoryless property)

$Pr(X>1) =Pr(Y<1)$

$Pr(Y<1)= 1 – (e^{(\frac{-1}{5})})$

Perhaps, I was a bit to vague in my attempt of (a). So here goes my attempt of(b).From my understanding of the question , it is asking what it the probability that the time between events(in this case animal sightings) is more than 25 minutes given you have been running for ten minutes

Now from what I know the fact that you have been running ten minutes is irrelevant this is due to memory-less property of the exponential distribution

so without further ado I present my attempt at (b)

let $X $ be exponentially distributed random variable with $\lambda = 1/5$

then
$Pr(X >15) = 1 – Pr(X <= 15)$

Best Answer

You are struggling a bit to express yourself, so perhaps a nudge in that direction--in the form of both an English and mathematical explanation--will help the most. Here I will address only question (a) specifically, but in a way that readily applies to answering question (b), too.


Both questions are a bit confusing, so our first task is to adopt some reasonable interpretation and make that clear. For brevity, let's say that a wild animal sighting is a "point." My understanding of (a) is that it asks for the chance of seeing the fourth wild animal sometime during the eleventh minute (which starts exactly ten minutes after the beginning of the race and ends one minute later). Because this is the fourth, it implies you have already seen exactly three animals since the start of the race. This interpretation does not rule out seeing even a fifth, sixth, etc. animal during the eleventh minute, either: they are all consistent with seeing "at least one" more animal after the first three.

What I just described is called an "event." Events have probabilities. The standard notation is to use a capital letter (or something similar) to name events, such as $A$, and to let something like $\Pr(A)$ denote the probability of $A$.

To find this chance, we need to adopt a probability model for the events. The question suggests that we treat the points as a Poisson process. The salient properties of this process (in the present context) are that

  1. It provides probabilities for events of the form "$k$ points occur during a predefined time interval $I$" where $k$ is a natural number $0, 1, \ldots$ and $I$ is some predetermined set of times during the race itself.

  2. Events concerning intervals with no or zero time overlap are independent. This means we can compute their probabilities by multiplying the probabilities of the separate time intervals. (This is the "memoryless" property.)

  3. For simulating the race, it is convenient to know that the times elapsed between successive events have an Exponential distribution and that the elapsed times between pairs of successive events that do not overlap are independent.

We might also care to exploit basic properties of probability. One property that is frequently useful is its "additivity": to find the probability of disjoint events, just add the probabilities of those events. This easily implies another property: the probability that an event does not occur is $1$ minus the probability that it does occur.

With these tools, how do we solve problem (a)? There usually is more than one way. One method that is usually worth trying is to break any event which is somewhat complex (like this one) into simpler disjoint events, with an aim to exploiting the additivity property. For instance, we can view it like this:

The fourth point occurring during the eleventh minute (event $A$) can happen in precisely one of the following ways:

  • No points happen during the first ten minutes and at least four points happen in the eleventh minute, or else

  • One point happens during the first ten minutes and at least three points happen in the eleventh minute, or else

  • Two points happen during the first ten minutes and at least two points happen in the eleventh minute, or else

  • Three points happen during the first ten minutes and at least one point happens in the eleventh minute.

(If you interpret the question differently, it's likely a similar approach will still solve the problem for you.)

Because these probabilities are disjoint--no two of them can both happen during the race--additivity implies we only need to compute the four probabilities and sum them. Each of these four component probabilities concerns two events, one covering the first ten minutes and the other covering the eleventh minute. These are all pairs of independent events, because there is zero overlap in their time spans in each case. Property (2) applies, allowing us to compute these four terms by multiplying the probabilities of each of its two events.

It may help to see the same argument expressed in mathematical notation. For any given natural number $k$ (counting the points) and time interval from $a$ minutes to $b$ minutes after the start of the race, let $S(k, a, b)$ be the event where exactly $k$ points occur in the interval $[a, b)$. For instance, $S(4, 0, 10)$ describes exactly four wild animal sightings during the first ten minutes. Similarly, let $T(k, a, b)$ be the event where at least $k$ points occur in the interval $[a, b)$. The foregoing argument asserts that

$$\eqalign{ A = &\left(S(0, 0, 10) \text{ and } T(4, 10, 11)\right) \text{ or } \left(S(1, 0, 10) \text{ and } T(3, 10, 11)\right) \\ &\text{ or } \left(S(2, 0, 10) \text{ and } T(2, 10, 11)\right) \text{ or } \left(S(3, 0, 10) \text{ and } T(1, 10, 11)\right). }$$

To compute its probability we need to compute probabilities of events of the form $S(k, 0, 10)$ and $T(4-k, 10, 11)$, multiply them, and add these up as $k$ ranges from $0$ up to $3$:

$$\Pr(A) = \sum_{k=0}^3 \Pr(S(k,0,10)) \Pr(T(4-k,10,11)).$$

That is what the previous English description was trying to say--and when you read formulas like this, you should be translating them (in your head) back into similar full English sentences.

The rest is routine: $\Pr(S(k, 0, 10))$ is obtained from the Poisson distribution. Because it covers $10-0=10$ minutes and the rate of wild animal sightings is one per five minutes = $1/5$ per minute, the Poisson parameter must be $10\times 1/5=2$ for this calculation. $\Pr(T(j, 10, 11))$ is also obtained from the Poisson distribution (but with a parameter of $1/5$ because it concerns an interval of only one minute, not ten). It is a tail probability:

$$\Pr(T(j,10,11)) = \Pr(S(j,10,11)) + \Pr(S(j+1,10,11)) + \cdots + \Pr(S(j+n,10,11)) + \cdots$$

because "$j$ or greater" points can be decomposed into exactly $j$, $j+1, \ldots, j+n, \ldots$ points. Software and tables typically will tell you the chance of $j$ or fewer (this is the cumulative distribution function). But additivity implies that the chance of $j$ or greater is one minus the chance of $j-1$ or fewer, so we're ok: we can do all the calculations. (Take note of that "$-1$" there: if we were to get this wrong, we would obtain a reasonable looking but incorrect answer. It can be difficult to check the results of complex probability calculations!)


As an example and a check of our work, let's simulate the race and compare the simulated results to the calculations. I use R because it's freely available, simple to code, and fast: this simulation will take only a few seconds to run the race over and over again a million times. In each race we record the times of the first four animal sightings. The algorithm exploits the exponential waiting time distribution:

niter <- 10^6 # Number of race iterations
set.seed(17)  # (Creates reproducible results)
x <- apply(matrix(rexp(niter * 4, rate=1/5), nrow=4), 2, cumsum) # Animal sighting times
sum(x[4,] >= 10 & x[4,] < 11) / niter                            # Estimate of Pr(A)

The penultimate line generates four independent exponentially distributed times for each race and adds them up, race by race, thereby simulating the (precise) times of four wild animal sightings. The last line finds the proportion of simulated races in which the fourth sighting occurred between minute 10 and minute 11. The output of this simulation is

0.037738

In other words, in 3.7738% of all simulated races, the fourth wild animal sighting occurred during the eleventh minute. Let's compare it to the exact solution developed previously. That answer is readily expressed using the dpois probability function (to compute $\Pr(S,a,b)$) and the ppois cumulative probability function (to compute $\Pr(T,a,b)$):

sum(dpois(0:3, 10/5) * (1 - ppois(3:0, 1/5)))

Its output is

0.0377710388

This is extremely close to the simulated results: it corresponds to an expectation that $A$ should occur in 37,771 races out of a million, while in the simulation event $A$ actually occurred 37,738 times. It is fair to attribute the difference of 33 occurrences to chance variation, because each time we simulate a million of these races, the count will vary (typically) by about 191 by chance alone. (This is a simple calculation based on the Binomial distribution.) I may conclude, then, that the agreement of the simulation and the calculation are as good as one may expect, so that either both are correct or both are equally incorrect! I doubt the latter has occurred, because the simulation works in such a different manner from the calculation: it uses exponential variates while the calculation was based solely on the Poisson distribution. If I am incorrect, then, it would only be through a consistent mis-interpretation of the question and not through any fundamental error of reasoning or some numerical mistake. That is the value of using simulations to check your answers.


Comments

The simulation suggests a quick, easy, do-it-in-your-head solution. Here's how it runs:

The question asks for the chance that the fourth animal siting occurs between minutes $10$ and $11$. Assume the times between sightings have an exponential distribution of one per five minutes and are independent. The sum of four independent exponentials is a Gamma$(4)$ distribution with PDF $f(x)=x^3\exp(-x)/3!$. Minutes $10-11$, when expressed in five minute periods, occur in the interval from $2.0$ to $2.2$, so the answer is given by $\int_{2.0}^{2.2}x^3\exp(-x)/3!\ dx$.

Let's approximate this area with a rectangle of base $2.2-2.0 = 0.2$ and height $f((2.0+2.1)/2) = f(2.1)$. The Binomial Theorem implies (among other things) that when we increase a positive quantity by $p\%$ its cube increases by approximately $3p\%$, so $2.1^3$ is about $3\times 5\% = 15\%$ greater than $2^3=8$. Having memorized that $\log(2)\approx 0.7$, we immediately recognize $\exp(-2.1)=\exp(0.7)^{-3}\approx 2^{-3}=1/8$. Finally, $3!=6$. Therefore the integral is approximately $(0.2)(8(1+15/100)\times 8 / 6)$, which is $15\%$ (about $1/6$) greater than $1/(5\times 6) = 0.0333\ldots$, or (roughly) $0.038$.

This mental math required only a few simple one-digit multiplications and produced an answer that happens to be correct to within one part in a hundred, which is as accurate as a simulation of $10000$ races would be.

The comparison between the Gamma PDF and our simulation results is excellent. In this figure, the probability asked for in question (a) is the area of the dark red region. We approximated that by the area of a rectangle of almost the same height.

Histogram

From this plot we can read off the probabilities of the fourth animal sighting for any time interval merely by estimating areas. For instance, a fast runner would finish this race in (say) 45 minutes. Their chance of not sighting at least four animals is the area to the right of 45, which looks pretty small--just a few percent. (R computes it as 1 - pgamma(45, 4, rate=1/5), giving $2.1\%$.)

The R code to make this plot (without the highlighted region) is

hist(x[4,], breaks=80, probability=TRUE, xlim=c(0,80), ylim=c(0,0.05), col=rgb(.9,.9,.9))
curve(dgamma(x,4,rate=1/5), add=TRUE, lwd=2, col="Red")
Related Question