[Math] Following a proof that conditional expectation is best mean square predictor

probabilityprobability theory

As a disclaimer, I'm very much a probability amateur. Please forgive me if I have some gaps in my knowledge; I'm still taking my first course on measure-theoretic probability and I have much left to learn.

I'm reading about conditional expectation in some of my professor's lecture notes, and he has a proof I'm having trouble following that shows that, for some $\sigma$-algebras $\mathcal{D}\subset\mathcal{F}$ and random variable $Y$ on $(\Omega, \mathcal{F})$, of all random variables on $(\Omega, \mathcal{D})$, $E(Y|\mathcal{D})$ is the best predictor of $Y$ in mean square error; i.e. $E(Y-X)^2$ is minimized over all $X$ on $(\Omega, \mathcal{D})$ when $X=E(Y|\mathcal{D})$.

He shows that an a.s.-unique minimizing r.v. exists, and I can follow that. However, when he's showing that $E(Y|\mathcal{D})$ minimizes, I get lost. Here's the proof:

$E(Y-X)^2 = E(Y-E(Y|\mathcal{D})+E(Y|\mathcal{D}) – X)^2$
$= E[(Y-E(Y|\mathcal{D})^2]+E[(E(Y|\mathcal{D})-X)^2] + 2E[(Y-E(Y|\mathcal{D}))(E(Y|\mathcal{D})-X)]$.

We'll focus on that last term on the RHS (here's where I'm about to get lost). He claims, first, that $2E[(Y-E(Y|\mathcal{D}))(E(Y|\mathcal{D})-X)] =2E[E[[(Y-E(Y|\mathcal{D}))(E(Y|\mathcal{D})-X)]|\mathcal{D}]]$; to make that a bit easier to read, if we let $\phi = (Y-E(Y|\mathcal{D}))(E(Y|\mathcal{D})-X)$, then he's saying $2E\phi =2E[E(\phi|\mathcal{D})].$ I'm not sure exactly why this is true; I'm only aware of the general equality $E(E(Z|\mathcal{D})) = EZ$ necessarily being true when $E|Z|<\infty$, or when $\sigma(Z)$ is independent from $\mathcal{D}$, or when $Z$ is $\mathcal(D)$-measurable (since then $E(Z|\mathcal{D})=Z$). Unfortunately, I don't really see why any of these have are true in this case. Maybe I'm missing something obvious? Or perhaps there's another condition for this statement to hold that I'm not aware of?

Moving past this point, he claims that
$2E[E[[(Y-E(Y|\mathcal{D}))(E(Y|\mathcal{D})-X)]|\mathcal{D}]]= 2E[(E(Y|\mathcal{D})-X)E[(Y-E(Y|\mathcal{D}))|\mathcal{D}]]$; again, to introduce notation for readability, if we let $\psi = (Y-E(Y|\mathcal{D}))$ and $\kappa = (E(Y|\mathcal{D})-X)$, he's saying that $2E[E(\psi\kappa|\mathcal{D})] =2E[\kappa E(\psi|\mathcal{D})]$. This is where confusion really sets in; I simply have no idea why this is true. I know that $E(\kappa|\mathcal{D})=\kappa$, since $\kappa$ is $\mathcal{D}$-measurable, but I'm not sure why $E(\psi\kappa|\mathcal{D})=E(\psi|\mathcal{D})E(\kappa|\mathcal{D})$. This is something that kind of resembles an independence condition; I don't see any reason $\psi$ and $\kappa$ have to be independent, though, so maybe it's something else? Some illumination here would be greatly appreciated.

The rest of the proof I can follow just fine, but I'll write about just to make sure I'm not making any errors there: from this point, since (in the notation of the above paragraph) $E(\psi|\mathcal{D})=0$ (which is easy to show), then the whole expectation is 0, so that we have $E(Y-X)^2 = E[(Y-E(Y|\mathcal{D}))^2]+E[(E(Y|\mathcal{D})-X)^2]\geq E[(Y-E(Y|\mathcal{D}))^2]$, with equality when $X=E(Y|\mathcal{D})$, so by the previous theorem on the a.s.-uniqueness of these minimizing functions, $X=E(Y|\mathcal{D})$ a.s.

Thanks for any light you can shed on those early steps in the proof! I'm really enjoying learning probability so far and I appreciate any help I can get on my way.

Best Answer

First of all, thanks for the well thought-out post. It was easy to see what your difficulties were.

  1. Note that $\mathbb{E}[(Y-X)^2]$ is always well-defined as the expectation of nonnegative random variables. However, in order for you to write $$\mathbb{E}[(Y-X)^2]=\mathbb{E}[(Y-E(Y|\mathcal{D})^2]+\mathbb{E}[(E(Y|\mathcal{D})-X)^2] + 2\mathbb{E}[(Y-E(Y|\mathcal{D}))(E(Y|\mathcal{D})-X)],$$ you have to assume that $(Y-E(Y|\mathcal{D}))(E(Y|\mathcal{D})-X)$ is integrable, or else you could possibly get $\infty-\infty$ on the right hand side. In practice, you probably have to assume that $X,Y$ are square-integrable in order for all the terms to be finite. By the Cauchy-Schwarz inequality, it is then clear that the random variable that you denoted by $\phi$ is integrable, i.e. such that $\mathbb{E}[|\phi|]<\infty$.

  2. In fact, if $\kappa$ is $\mathcal{D}$-measurable and $\psi$ is any integrable random variable, you have $\mathbb{E}[\kappa\psi\,|\,\mathcal{D}]=\kappa\mathbb{E}[\psi\,|\,\mathcal{D}]$. To see this, just go back to the definition of conditional expectation. That is, let $G$ be a $\mathcal{D}$-measurable set. Then, $$\mathbb{E}[\psi\underbrace{\kappa1_G}_{\mathcal{D}-\text{measurable}}]=\mathbb{E}[\mathbb{E}[\psi\,|\,\mathcal{D}]\kappa1_G],$$ thus $\mathbb{E}[\kappa\psi\,|\,\mathcal{D}]=\kappa\mathbb{E}[\psi\,|\,\mathcal{D}]$.

The rest of your proof seems fine to me.

Related Question