I recently came across this identity:

$$E \left[ E \left(Y|X,Z \right) |X \right] =E \left[Y | X \right]$$

I am of course familiar with the simpler version of that rule, namely that $E \left[ E \left(Y|X \right) \right]=E \left(Y\right) $ but I was not able to find justification for its generalization.

I would be grateful if someone could point me to a not-so-technical reference for that fact or, even better, if someone could lay out a simple proof for this important result.

## Best Answer

INFORMAL TREATMENTWe should remember that the notation where we condition on random variables is inaccurate, although economical, as notation. In reality we condition on the sigma-algebra that these random variables generate. In other words $E[Y\mid X]$ is meant to mean $E[Y\mid \sigma(X)]$. This remark may seem out of place in an "Informal Treatment", but it reminds us that our conditioning entities are collections of

sets(and when we condition on a single value, then this is a singleton set). And what do these sets contain? They contain theinformationwith which the possible values of the random variable $X$ supply us about what may happen with the realization of $Y$.Bringing in the concept of Information, permits us to think about (and use) the Law of Iterated Expectations (sometimes called the "Tower Property") in a very intuitive way:

The sigma-algebra generated by two random variables, is at least as large as that generated by one random variable: $\sigma (X) \subseteq \sigma(X,Z)$ in the proper set-theoretic meaning. So the

informationabout $Y$ contained in $\sigma(X,Z)$ is at least as great as the corresponding information in $\sigma (X)$.Now, as notational innuendo, set $\sigma (X) \equiv I_x$ and $\sigma(X,Z) \equiv I_{xz}$. Then the LHS of the equation we are looking at, can be written

$$E \left[ E \left(Y|I_{xz} \right) |I_{x} \right]$$ Describing verbally the above expression we have : "what is the expectation of {the expected value of $Y$ given Information $I_{xz}$} given that we have available information $I_x$

only?"Can we somehow "take into account" $I_{xz}$? No - we only know $I_x$. But if we use what we have (as we are obliged by the expression we want to resolve), then we are essentially saying things about $Y$ under the expectations operator, i.e. we say "$E(Y\mid I_x)$", no more -we have just exhausted our information.

Hence $$E \left[ E \left(Y|I_{xz} \right) |I_{x} \right] = E\left(Y|I_{x} \right)$$

If somebody else doesn't, I will return for the formal treatment.

A (bit more) FORMAL TREATMENTLet's see how two very important books of probability theory, P. Billingsley's Probability and Measure (3d ed.-1995) and D. Williams "Probability with Martingales" (1991), treat the matter of proving the "Law Of Iterated Expectations":

Billingsley devotes exactly three lines to the proof. Williams, and I quote, says

That's one line of text. Billingsley's proof is not less opaque.

They are of course right: this important and very intuitive property of conditional expectation derives essentially directly (and almost immediately) from its definition -the only problem is, I suspect that this definition is not usually taught, or at least not highlighted, outside probability or measure theoretic circles. But in order to show in (almost) three lines that the Law of Iterated Expectations holds, we need the definition of conditional expectation, or rather, its

defining property.Let a probability space $(\Omega, \mathcal F, \mathbf P)$, and an integrable random variable $Y$. Let $\mathcal G$ be a sub-$\sigma$-algebra of $\mathcal F$, $\mathcal G \subseteq \mathcal F$. Then there exists a function $W$ that is $\mathcal G$-measurable, is integrable and (this is the defining property)

$$E(W\cdot\mathbb 1_{G}) = E(Y\cdot \mathbb 1_{G})\qquad \forall G \in \mathcal G \qquad [1]$$

where $1_{G}$ is the indicator function of the set $G$. We say that $W$ is ("a version of") the conditional expectation of $Y$ given $\mathcal G$, and we write $W = E(Y\mid \mathcal G) \;a.s.$

The critical detail to note here is that the conditional expectation, has the same expected value as $Y$ does, not just over the whole $\mathcal G$,

but in every subset $G$ of $\mathcal G$.(I will try now to present how the Tower property derives from the definition of conditional expectation).$W$ is a $\mathcal G$-measurable random variable. Consider then some sub-$\sigma$-algebra, say $\mathcal H \subseteq \mathcal G$. Then $G\in \mathcal H \Rightarrow G\in \mathcal G$. So, in an analogous manner as previously, we have the conditional expectation of $W$ given $\mathcal H$, say $U=E(W\mid \mathcal H) \;a.s.$ that is characterized by

$$E(U\cdot\mathbb 1_{G}) = E(W\cdot \mathbb 1_{G})\qquad \forall G \in \mathcal H \qquad [2]$$

Since $\mathcal H \subseteq \mathcal G$, equations $[1]$ and $[2]$ give us

$$E(U\cdot\mathbb 1_{G}) = E(Y\cdot \mathbb 1_{G})\qquad \forall G \in \mathcal H \qquad [3]$$

But this is the defining property of the conditional expectation of $Y$ given $\mathcal H$.So we are entitled to write $U=E(Y\mid \mathcal H)\; a.s.$Since we have also by construction $U = E(W\mid \mathcal H) = E\big(E[Y\mid \mathcal G]\mid \mathcal H\big)$, we just proved the Tower property, or the general form of the Law of Iterated Expectations - in eight lines.