It would be extremely helpful if anyone gives me the formal definition of conditional probability and expectation in the following setting, given probability space
$ (\Omega, \mathscr{A}, \mu ) $ with $\mu(\Omega) = 1 $, and a random variable $ X : \Omega \rightarrow \mathbb{R}^n $, where for any borel set $ A \in \mathscr{B}(\mathbb{R}^n) $ we define
$$ \mathbb{P}(X \in A) = (X_*\mu)(A) = \mu(X^{-1}(A))=
\mu(\{\omega\in \Omega\ \ |\ \ X(\omega)
\in A\})\ \ \text{and}\ \ \mathbb{E}(X) = \int_\Omega Xd\mu $$
Regardless of $X, Y$ being discrete or continuous (with density $f_X, f_Y $ and joint density $f_{X,Y} $ w.r.t some measure $\nu$ on $\mathbb{R}^n $), I am asking for the definition
of $ \mathbb{P}(Y\in B\ |\ X \in A) $ and $ \mathbb{E}(Y|X) $ for all Borel sets $ A, B \in \mathscr{B}(\mathbb{R}^n) $, keeping in mind that $ \mathbb{P}(X \in A) $ may well be zero.
In our probability class some thing of the following sort was mentioned, where
$\delta_x$ is the Dirac distribution at $ x $, then we have
$$ \mathbb{E}(Y|X = x) = \frac{\mathbb{E}(\delta_x(X)Y)}{\mathbb{P}(X=x)}$$
out of which I can't make any sense. Any appropiate reference for these is also very much welcome.
Thank you.
Best Answer
Let throughout this post $(\Omega,\mathcal{F},P)$ be a probability space, and let us first define the conditional expectation ${\rm E}[X\mid\mathcal{G}]$ for integrable random variables $X:\Omega\to\mathbb{R}$, i.e. $X\in L^1(P)$, and sub-sigma-algebras $\mathcal{G}\subseteq\mathcal{F}$.
Note: It makes sense to talk about the conditional expectation since if $U$ is another random variable satisfying (i)-(iii) then $U=Z$ $P$-a.s.
I'm not aware of any other definition of $P(Y\in B\mid X\in A)$ than the obvious, i.e. $$ P(Y\in B\mid X\in A)=\frac{P(Y\in B,X\in A)}{P(X\in A)} $$ provided that $P(X\in A)>0$. The only exception being when $A$ contains a single point, i.e. $A=\{x\}$ for some $x\in\mathbb{R}$. In this case, the object $P(Y\in B\mid X=x)$ is defined in terms of a regular conditional distribution.
Let us first define regular conditional probabilities. Let $X:\Omega\to\mathbb{R}$ be a random variable.
Note: A mapping satisfying (i) and (ii) is often called a Markov kernel. Furthermore, since $(\mathbb{R},\mathcal{B}(\mathbb{R}))$ is a nice space, the regular conditional probability is unique in the sense that if $\tilde{P}^X(\cdot\mid\cdot)$ is another regular conditional probability of $P$ given $X$, then we have that $P^X(\cdot\mid x)=\tilde{P}^X(\cdot\mid x)$ for $P_X$-a.a. $x$. Here $P_X=P\circ X^{-1}$ is the distribution of $X$.
Now let us introduce another random variable $Y:\Omega\to\mathbb{R}$, and $P^X(\cdot\mid \cdot)$ still denotes a regular conditional probability of $P$ given $X$.
Instead of $P_{Y\mid X}(B\mid x)$ one often writes $P(Y\in B\mid X=x)$.
An easy consequence of this definition is that $(B,x)\mapsto P_{Y\mid X}(B\mid x)$ is a Markov kernel and for any $A,B\in\mathcal{B}(\mathbb{R})$ we have $$ \int_A P_{Y\mid X}(B\mid x)\,P_X(\mathrm dx)=P(\{X\in A\}\cap\{Y\in B\}). \tag{1} $$
In fact, $P_{Y\mid X}(\cdot \mid \cdot)$ is a regular conditional distribution of $Y$ given $X$ if and only if $P_{Y\mid X}(\cdot\mid\cdot)$ is a Markov kernel and satisfies $(1)$. Again $(1)$ is often referred to as the defining equation.
Let us denote $\psi(x)={\rm E}[U\mid X=x]$. Then we have the following:
The following is an extremely useful rule when calculating with conditional distributions:
The following example shows how this rule can be useful: Let $X$ and $Y$ be independent $\mathcal{N}(0,1)$ random variables, and let $U=X+Y$. Then we claim that $U\mid X=x\sim \mathcal{N}(x,1)$ for $P_X$-a.a. $x$. To see this, note that by the rule above, the distribution of $U\mid X=x$ and $Y+x\mid X=x$ is the same. But since $Y$ is independent of $X$ we have that $Y+x\mid X=x$ is distributed as $Y+x$. We can write it as follows: $$ U\mid X=x\sim Y+x\mid X=x\sim Y+x\sim\mathcal{N}(x,1). $$