How was the recursive relation of this expected value derived

expected valueprobability theoryrandom variables

I'm trying to calculate the expected value of the following variable defined as follows:

For $n \in \mathbb{N}$ we pick randomly with equal probability a number $Y_1$ out of the set $\{0,\dots,n\}$. We then pick another number $Y_2$ out of $\{0,\dots,Y_1\}$ and continue recursively until we get $0$.
We define $X$ as the number of pnicks that are necessary to get to pick the $0$

What's confusing me here is that we defined random variables in our probability course as functions from a countable set $\Omega$ to $\mathbb{R}$ however this random variable can take the value $\infty$ doesn't it? Why isn't that a problem in this case?

The solution calculates the value recursively where $X(n)$ is the number of necessary steps when picking from the set $\{0,\dots,n\}$.

  1. They start with the base case which would be geometric distribution with parameter $\frac{1}{2}$ thus $\mathbb{E}[X(1)]=\frac{1}{2}$
  2. In the induction step they use the following reasoning: Assuming $n\geq 2$ if in the first pick we get a $k\neq0$ then the remaining number of steps has the same distribution as $X(k)$ thus:
    $$\mathbb{E}[X(n)]=1+ \sum_{k=1}^n \mathbb{E}[X(k)] \cdot \mathbb{P}[Y_1=k]$$

Then one can solve the recursion and calculate the expected value.
What's bothering me though is the second step. Although it makes sense intuitevly, I don't see why it is mathematically correct, especially the statement in bold and how the equation follows from it.

Best Answer

Let's look at a very simple example. Suppose $n = 2$ and we want to evaluate $\operatorname{E}[X(2)]$. We have three possible choices for $Y_1$:

$Y_1 = 0$: Then we stop, and $X = 1$.

$Y_1 = 1$: Then $Y_2 \in \{0, 1\}$, and there is some expected number of additional steps, $\operatorname{E}[X(1)]$, that it will take to reach $0$, but we have used up one step already, so the overall expected number of steps will be $1 + \operatorname{E}[X(1)]$.

$Y_1 = 2$: Then $Y_2 \in \{0, 1, 2\}$ and we have changed nothing, essentially having taken a step but the expected number of additional steps needed to get to $0$ remains unchanged because we haven't eliminated any numbers. In this case, the overall expected number of steps will be $1 + \operatorname{E}[X(2)]$.

So, to compute the total expectation, we must take each of these cases and weight them by their probability of occurring:

$$\operatorname{E}[X(2)] = (1 \cdot \Pr[Y_1 = 0]) + (1 + \operatorname{E}[X(1)])\Pr[Y_1 = 1] + (1 + \operatorname{E}[X(2)])\Pr[Y_1 = 2].$$ But since this is a sum taken over all elementary outcomes of $Y_1$ (that is to say, $\Pr[Y_1 = 0] + \Pr[Y_1 = 1] + \Pr[Y_1 = 2] = 1$), we can "pull out" the $1$ like so: $$\operatorname{E}[X(2)] = 1 + \operatorname{E}[X(1)]\Pr[Y_1 = 1] + \operatorname{E}[X(2)]\Pr[Y_1 = 2].$$

Now you can see how this reasoning extends to larger $n$: $$\operatorname{E}[X(n)] = 1 + \sum_{k=1}^n \operatorname{E}[X(k)] \Pr[Y_1 = k],$$ which is the formula stated in your question.


From the nature of the follow-up questions in your comments, it is clear that you need to revisit some of the probability concepts on which this solution is based. First is the law of total probability, and its relative the law of total expectation.

Let's consider the following example. Suppose I have a coin and two six-sided dice. These are not your typical coin and dice though: the coin is twice as likely to show heads than tails, and although the dice are not loaded, the numbers on their faces are not $1$ through $6$, but instead:

  • Die A: $\{1, 1, 2, 3, 5, 5\}$
  • Die B: $\{2, 4, 4, 5, 6, 6\}$

Now, say we flip the coin and the outcome of the flip determines which die we roll. If the coin is heads, we roll Die $A$. If tails, we roll $B$. Each of the six faces occurs with equal probability but some of the numbers occur more than once. Your score is the rolled value after the coin flip and die roll. If I asked you what is the expected value of the score, how would you compute it? Clearly, on average, if you roll Die $A$, your expected score is $$\frac{1+1+2+3+5+5}{6} = \frac{17}{6} \approx 2.8333.$$ If you roll Die $B$, your expected score is $$\frac{2+4+4+5+6+6}{6} = \frac{27}{6} = 4.5.$$ But because you are flipping an unfair coin to decide which die to roll, on average, you are rolling Die $A$ twice as often as Die $B$. So to get the overall expected value of your score, the expected values we computed for each die must be weighted by the respective probabilities of rolling those dice; i.e., the average score should be $$\frac{17}{6} \cdot \frac{2}{3} + \frac{27}{6} \cdot \frac{1}{3} = \frac{61}{18} \approx 3.3889.$$ This should be entirely intuitive and if you do not understand this, you must review probability theory from the beginning.

But what did we actually do here? We applied the law of total expectation. The coin flip outcome is a Bernoulli random variable, say $$X \sim \operatorname{Bernoulli}(p = 2/3)$$ in which $X$ counts the number of heads observed. So $X = 1$ means we flipped heads and $X = 0$ means we flipped tails; and $\Pr[X = 1] = p = 2/3$ as desired. We can define $A$ and $B$ as discrete random variables with mass functions $$\Pr[A = 1] = 1/3, \quad \Pr[A = 2] = 1/6, \quad \Pr[A = 3] = 1/6, \quad \Pr[A = 5] = 1/3, \\ \Pr[B = 2] = 1/6, \quad \Pr[B = 4] = 1/3, \quad \Pr[B = 5] = 1/6, \quad \Pr[B = 6] = 1/3.$$ Then we let $S$ be the random score, so in particular $$S = \begin{cases} A, & X = 1 \\ B, & X = 0 \end{cases}$$ with $$\operatorname{E}[S \mid X = 1] = \operatorname{E}[A], \\ \operatorname{E}[S \mid X = 0] = \operatorname{E}[B].$$ Then by the law of total expectation, $$\operatorname{E}[S] = \operatorname{E}[S \mid X = 1]\Pr[X = 1] + \operatorname{E}[S \mid X = 0]\Pr[X = 0].$$ This is exactly what we calculated above, formalized into probability notation. Proving the law of total expectation is beyond the scope of this discussion; I have merely provided motivation for why it should be true. Consult a textbook in probability theory for a proof.

As for your other question about "infinity," I do not understand why you seem concerned about this. Consider a random variable $$Y \sim \operatorname{Geometric}(p), \\ \Pr[Y = y] = (1-p)^{y-1} p, \quad y \in \{1, 2, 3, \ldots\}.$$ This parametrization of a geometric random variable counts the total number of independent and identically distributed Bernoulli trials with probability of success $p$ that occur upon observing the first success. So for example, with our earlier unfair coin (ignore the dice), if "success" is defined as flipping a head, and we repeatedly flip the coin until we observe the first head and take $Y$ as the number of times we flipped the coin, then $Y$ is geometric with $p = 2/3$, and $$\Pr[Y = 1] = (1 - 2/3)^0 (2/3) = 2/3.$$ This should be obvious: with probability $2/3$, we get heads on the first flip. What is the probability that it takes exactly two flips to get heads? We would need to first get tails, then heads, so this is $$\Pr[Y = 2] = (1 - 2/3)^1 (2/3) = 2/9.$$ And what is the probability that it takes exactly five flips to get the first heads? It is $$\Pr[Y = 5] = (1 - 2/3)^4 (2/3) = 2/243.$$ On average, how many flips are needed? This is $$\operatorname{E}[Y] = \sum_{y=1}^\infty y \Pr[Y = y] = \frac{3}{2} = 1.5.$$ Are you at all concerned that this sum is an infinite series; that there is some nonzero probability that it might take $10^{10^{10}}$ or more flips to get the first instance of heads? Of course not. the probability of such an outcome becomes extremely small. So why are you suddenly concerned about "infinity" in the original question? In order for the process to continue indefinitely, each time a $Y_k$ is picked, it must not be $0$. Even in the case where there is only $\{0,1\}$ to choose from, the number of additional steps before terminating is a geometric random variable with $p = 1/2$. The process is assured to stop in finite time with probability $1$ in as much as it is assured that you will eventually get heads if you flip a fair coin long enough.

Related Question