[Math] Definition of Conditional Probability by Measure Theory

measure-theoryprobability

I was reading a book on information theory and entropy by Robert Gray, when I saw the following definition of conditional probability:

Given a probability space $(\Omega,\mathcal{B}, P)$ and a
sub-$\sigma$-field $\mathcal{G}$, for any event $H\in\mathcal{B}$ the
conditional probability $m(H\text{ }|\text{ }\mathcal{G})$ is defined
as any function , say $g$, which satisfies the two properties:

(1) $g$ is measurable with respect to $\mathcal{G}$

(2) $\displaystyle\int_{G}ghdP=m(G\bigcap{}H)$; all $G\in\mathcal{G}$

I am quite confused with this definition since it is very different from the definition through joint probability of events.

I understand what measurable function, sub-$\sigma$-field and probability space are, and I'm guessing that the author is trying to definie the measure $m$ through the measurable function $g$, but I don't quite understand what the second requirement is saying. Especially, what does that h in $\displaystyle\int_{G}ghdP$ refer to? it just jumped out of nowhere in the book, so I'm suspecting that it may have some conventional meaning?

I'd appreciate it a lot if someone can help.
Thank you!!

Best Answer

The starting point for abstract measure theoretic conditional probability is conditional expectation. Essentially, one uses the identity $P(A)=\mathbb{E}(1_A)$.

Now let $(\Omega,\mathcal{B},P)$ be a probability space, $f$ a random variable and $\mathcal{G}$ a sub-$\sigma$-algebra of $\mathcal{B}$. The conditional expectation of $f$ with respect to $\mathcal{G}$ is a $\mathcal{G}$-measurable function $\mathbb{E}_\mathcal{B}$ such that for all $G\in\mathcal{G}$ $$\int_G \mathbb{E}_\mathcal{B}~dP=\int_G f~dP.$$ The notion is not very intuitive, but the idea is the following: Since $\mathbb{E}_\mathcal{B}$ is $\mathcal{G}$-measurable, it uses only the information in $\mathcal{G}$. The integral condition says that $\mathbb{E}_\mathcal{B}$ "averages $f$ out" over sets in $\mathcal{G}$.

Now if we want to calculate the conditional probability of the event $H\in\mathcal{B}$ with respect to the sub-$\sigma$-algebra $\mathcal{G}$, we simply take the conditional expectation of the indiacator function $1_H$. Then, a conditional probability of $H$ with respect to $\mathcal{G}$ is a $\mathcal{G}$-measurable function $\mathbb{P}^H_\mathcal{G}$ such that for all $G\in\mathcal{G}$ $$\int_G \mathbb{P}^H_\mathcal{G}~dP=\int_G 1_H~dP.$$ Since $\int_G 1_H~dP=P(H\cap G)$, this can be rewritten as $$\int_G \mathbb{P}^H_\mathcal{G}~dP=P(H\cap G).$$

This is fairly standard material, so I assume the author made simply some typos. The $h$ is superflous and the $m$ should be $P$.

Related Solutions

[Math] Conditional Expectation given X is measurable wrt to sigma field

The conditional expectation $E(X|\mathcal G)$ is a random variable determined (up to sets of probability $0$) by two properties: (i) $E(X|\mathcal G)$ is $\mathcal G$-measurable, and (ii) for each $\mathcal G$-measurable set $B$, the integral of $E(X|\mathcal G)$ over $B$ is equal to the integral of $X$ over $B$. If $X$ is itself $\mathcal G$-measurable, then it has these two properties, and so it must be $E(X|\mathcal G)$.

Conditional Probability on Zero Probability Events – Definition

The short answer

Yes, if $N=\mathbb{R}^n$ and $\mathcal{N}$ is the Borel field, then, for every $A \in \mathcal{M}$, the limit $\lim_{\Delta y \mapsto 0} P(X \in A\ |\ Y \in (y-\Delta y, y+\Delta y))$ exists $P_Y$-almost surely, and this is true whether or not $Y = f(X)$, and whether or not $X,Y$ have a joint density. (This is the content of Theorem 9(1) below.)
Furthermore, the function $f:\mathbb{R}\rightarrow\mathbb{R}$ obtained by setting $f(y) := \lim_{\Delta y \mapsto 0} P(X \in A\ |\ Y \in (y-\Delta y, y+\Delta y))$ wherever possible, and, say, $f(y):=0$ elsewhere, is consistent with the traditional measure-theoretic definition of $P(X\in A\ |\ Y=y)$, given by \eqref{CondProb}. (This is the content of Corollary 17(1).)

The proof of these facts is a consequence of theorems 1.29 ("Differentiating measures") and 1.30 ("Differentiation of Radon measures") in reference [3] (a shout-out to user Del, who pointed me to this reference). It makes use of the concept of the derivative of an outer measure w.r.t another outer measure.

I will devote the rest of this answer to carefully derive facts 1 and 2 stated above. As far as I know, this is the first time this fundamental, intuitive result, which is often claimed (for instance, on p. 157 of [4], p. 136 of [1]), is proved. I'll be grateful (if somewhat disappointed) to anyone who can cite a precedence.

Example

Before embarking on a formal proof, let's see how the results of the next section can be used to introduce the concept of "probability conditioned on a non-discrete random variable" in a way that is simultaneously intuitive and mathematically sound.

Consider, for instance, the following excerpt taken from a popular undergraduate textbook ([6] example 5e, p. 255).

Consider $n + m$ trials having a common probability of success. Suppose, however, that this success probability is not fixed in advance but is chosen from a uniform $(0,1)$ population.

Letting $N$ denote the number of successes, and letting $X$ denote the probability that a given trial is a success, this excerpt gives natural rise to the concept of conditional probability, but when we attempt to parlay our intuition into formulas, we discover that an expression of the form $P(N = n\ |\ X=x)$ is not well-defined as per the familiar formula $P(A|B) = \frac{P(A\cap B)}{P(B)}$, since $P(X=x)=0$.

Intuition suggests to overcome this obstacle by defining $$ P(N=n\ |\ X=x) = \lim_{\Delta x\downarrow 0} \frac{P(N=n, x - \Delta x < X < x + \Delta x)}{P(x - \Delta x < X < x + \Delta x)}, $$ provided the limit exists. Theorem 9(1) assures us that the limit indeed exists almost everywhere. Theorem 17(1) implies that if we so define $P(N=n\ |\ X=x)$ wherever possible, we will obtain a function that is a conditional probability in the traditional measure-theoretic sense (described in the next paragraph), hence we may soundly subject it to the usual manipulations involving conditional probabilities, such as the law of total probability. Note that in this example the joint random variable $(N,X)$ does not have a joint density (more precisely no joint density w.r.t. the Lebesgue measure on $\mathbb{R}^2$).

Now scratch everything we have discussed so far, and suppose we start by defining the conditional probability $P(X\in A\ |\ Y=y)$ in the traditional measure-theoretic manner (cf. [2] theorem 5.3.1, p. 205) as

any solution, $\varphi$, to the integral system of equations $$ \int_B \varphi\ dP_Y = P(X\in A, Y\in B),\hspace{1cm}B\text{ Borel}, \tag{*}\label{CondProb} $$ where $P_Y$ is the distribution of $Y$, i.e. the probability measure induced on the Borel field via the formula $P_Y(E) = P(Y\in E)$.

(This concept of conditional probability is sometimes called "conditional distribution", as in [5] theorem 6.3, p. 107, and the term "conditional probability" is reserved to a closely-related, but different concept. I will keep to the "conditional probability" terminology.)

Given a two-dimensional random variable $(X,Y)$ with a joint density $f(x,y)$, we may now prove that the familiar definition of "conditional density", namely $$ f_{X|Y=y}(x) = \frac{f(x,y)}{f_Y(y)},\hspace{1cm}\text{wherever the denominator does not vanish} $$ can be used to generate conditional probabilities of the form $P(X\in A\ |\ Y=y)$.

Applying this technique to the following problem, taken from the same textbook ([6] example 5b, p. 252), we find that $P(X > 1\ |\ Y=y) = e^{-1/y}$, $y>0$.

Suppose that the joint density of $X$ and $Y$ is given by $$ f(x,y) = \begin{cases} \frac{e^{-x/y}e^{-y}}{y} & 0 < x < \infty, 0 < y < \infty \\ 0 & \text{otherwise} \end{cases} $$ Find $P(X > 1\ |\ Y=y)$.

Since the solution we obtained is continuous, Corollary 17(2) yields that, for every $y>0$, $$ e^{-1/y} = \lim_{\Delta y\downarrow 0}\frac{P(X>1, y-\Delta y<Y<y+\Delta y)}{P(y-\Delta y<Y<y+\Delta y)}. $$

The formal derivation

Notation 1 Let $n \in \{1, 2, \dots\}$ and let $r \in (0,\infty)$. For every $x \in \mathbb{R}^n$ we denote the open $n$-ball of (Euclidean) radius $r$ about $x$ by $B^{(n)}_r(x)$.

Notation 2 Let $n \in \{1, 2, \dots\}$. We denote the Euclidean topology on $\mathbb{R}^n$ by $\mathcal{E}_n$.

Notation 3 Let $n \in \{1, 2, \dots\}$. We denote the Borel $\sigma$-algebra on $\mathbb{R}^n$ by $\mathcal{B}_n$.

Notation 4 Let $n \in \{1, 2, \dots\}$. For every outer measure $\mu$ on $\mathbb{R}^n$, we denote the collection of $\mu$-measurable sets by $\mathcal{M}_\mu$.

Fix $n \in \{1, 2, \dots\}$ for the remainder of the proof.

Definition 5 An outer measure $\mu$ on $\mathbb{R}^n$ is Radon iff the following three conditions hold.

$\mathcal{B}_n \subseteq \mathcal{M}_\mu$.
For every $A\subseteq\mathbb{R}^n$ there exists a $B\in\mathcal{B}_n$ such that $A\subseteq B$ and $\mu(A) = \mu(B)$.
For every $\mathcal{E}_n$-compact $K\subseteq\mathbb{R}^n$, $\mu(K) < \infty$.

Definition 6 Let $\mu, \nu$ be Radon outer measures on $\mathbb{R}^n$. We denote by $\mathrm{Diff}^\nu_\mu$ the set consisting of all $x \in \mathbb{R}^n$ for which the following pair of conditions hold.

For all $r \in (0,\infty)$, $\mu\left(B^{(n)}_r(x)\right) > 0$.
There exists some $d \in \mathbb{R}$ that satisfies: $$ d = \lim_{r \downarrow 0} \frac{\nu\left(B^{(n)}_r(x)\right)}{\mu\left(B^{(n)}_r(x)\right)}. $$

Definition 7 Let $\mu, \nu$ be Radon outer measures on $\mathbb{R}^n$. We set $$ D^\nu_\mu(x) := \begin{cases} \lim_{r \downarrow 0} \frac{\nu\left(B^{(n)}_r(x)\right)}{\mu\left(B^{(n)}_r(x)\right)} &, x \in \mathrm{Diff}^\nu_\mu \\ 0 &, \text{otherwise}. \end{cases} $$

Definition 8 Let $\mu, \nu$ be Radon outer measures on $\mathbb{R}^n$. We denote with $\mathrm{\mathbf{Diff}}^\nu_\mu$ the collection consisting of all $Z \subseteq \mathbb{R}^n$ such that the following pair of conditions hold.

$Z \subseteq \mathrm{Diff}^\nu_\mu$.
$Z = \mathbb{R}^n\setminus A$ for some $A \in \mathcal{M}_\mu$ with $\mu(A) = 0$.

Theorem 9 Let $\mu, \nu$ be Radon outer measures on $\mathbb{R}^n$.

$\mathrm{\mathbf{Diff}}^\nu_\mu \neq \emptyset$.
For every $Z \in \mathrm{\mathbf{Diff}}^\nu_\mu$, $D^\nu_\mu$ is $\mathcal{Z}/\mathcal{B}_1$-measurable, where $\mathcal{Z}$ is the subset $\sigma$-algebra induced on $Z$ by $\mathcal{M}_\mu$.

Proof See [3], theorem 1.29, p. 48. Q.E.D.

Definition 10

Let $\mu, \nu$ be outer measures on $\mathbb{R}^n$. $\nu$ is absolutely continuous w.r.t. $\mu$, written $\nu \ll \mu$, provided, $\mu(A) = 0$ implies $\nu(A) = 0$, for every $A \subseteq \mathbb{R}^n$.
Let $\mathcal{F}$ be a $\sigma$-algebra on $\mathbb{R}^n$, and let $\mu, \nu$ be measures on $\mathcal{A}$. $\nu$ is absolutely continuous w.r.t. $\mu$, written $\nu \ll \mu$, provided, $\mu(A) = 0$ implies $\nu(A) = 0$, for every $A \in \mathcal{F}$.

Lemma 11 Let $\mu, \nu$ be measures on $\mathcal{B}_n$ such that $\nu\ll\mu$, and such that, for every $\mathcal{E}_n$-compact $K$, $\mu(K), \nu(K) < \infty$. Then $\mu, \nu$ can be extended to Radon outer-measures on $\mathbb{R}^n$, $\mu^*, \nu^*$, respectively, such that $\nu^*\ll\mu^*$.

Proof

For every $A \subseteq \mathbb{R}^n$ define $$ \begin{align} \mu^*(A) &:= \inf \left\{\sum_{n = 1}^\infty \mu(B_n)\ :\!\big|\ \{B_1, B_2, \dots\} \subseteq \mathcal{B}_n,\ A \subseteq \bigcup_{n=1}^\infty B_n\right\}, \\ \nu^*(A) &:= \inf \left\{\sum_{n = 1}^\infty \nu(B_n)\ :\!\big|\ \{B_1, B_2, \dots\} \subseteq \mathcal{B}_n,\ A \subseteq \bigcup_{n=1}^\infty B_n\right\}. \end{align} $$

According to [7] theorem 2.21 (p. 38), $\mu^*, \nu^*$ are outer-measures on $\mathbb{R}^n$. According to [7] theorem 20.1(b) (p. 502), $\mu^*, \nu^*$ are extensions of $\mu, \nu$, respectively. This implies, in particular, that, for every $\mathcal{E}_n$-compact $K$, $\mu^*(K) = \mu(K) < \infty$. According to [7] theorem 20.1(a) (p. 502), $\mathcal{B}_n \subseteq \mathcal{M}_{\mu^*} \cap \mathcal{M}_{\nu^*}$. According to [7] Proposition 20.9 (p. 507), for every $A\subseteq\mathbb{R}^n$ there exist a $B \in \mathcal{B}_n$ such that $A \subseteq B$ and both $\mu^*(A) = \mu^*(B)$ and $\nu^*(A) = \nu^*(B)$. Thus, $\mu^*$ and $\nu^*$ are each Radon.

Let $A \subseteq \mathbb{R}^n$ be such that $\mu^*(A) = 0$. By the preceding paragraph there exists some $B \in \mathcal{B}_n$ such that $A \subseteq B$ and both $\mu^*(A) = \mu^*(B)$ and $\nu^*(A) = \nu^*(B)$. So $$ \mu(B) = \mu^*(B) = \mu^*(A) = 0. $$ So $$ \nu^*(A) = \nu^*(B) = \nu(B) \overset{\nu\ll\mu}{=} 0. $$ Thus $\nu^* \ll \mu^*$.

Q.E.D.

Theorem 12 Let $\mu, \nu$ be Radon outer measures on $\mathbb{R}^n$, and let $Z \in \mathrm{\mathbf{Diff}}^\nu_\mu$. If $\nu \ll \mu$, then, for every $B \in \mathcal{M}_\mu$, $$ \nu(B) = \int_B D^\nu_\mu\mathbb{1}_Z d\mu. $$

Proof See [3], theorem 1.30, p. 50. Q.E.D.

Definition 13 Let $(\Omega, \mathcal{F}, P)$ be a probability space, and let $Y:\Omega\rightarrow\mathbb{R}^n$ be $\mathcal{F}/\mathcal{B}_n$-measurable. We denote with $P_Y$ the probability measure induced on $\mathcal{B}_n$ by $Y$ via $(\Omega, \mathcal{F}, P)$.

Notation 14 For every probability measure $\mu$ on $\mathcal{B}_n$, we denote by $\overline{\mathcal{B}_n^\mu}$ the completion of $\mathcal{B}_n$ w.r.t. $\mu$, and we denote by $\overline{\mu}$ the unique extension of $\mu$ to $\overline{\mathcal{B}_n^\mu}$.

Definition 15 Let $(\Omega, \mathcal{F}, P)$ be a probability space, let $A \in \mathcal{F}$, and let $Y:\Omega \rightarrow \mathbb{R}^n$ be $\mathcal{F}/\mathcal{B}_n$-measurable. We denote by $P(A\ |\ Y)$ the set of conditional probabilities of $A$ conditioned on $Y$, as follows. $P(A\ |\ Y)$ shall consist of all functions $f:\mathbb{R}^n\rightarrow\mathbb{R}$ that are $\overline{\mathcal{B}_n^{P_Y}}/\mathcal{B}_1$-measurable, $\overline{P_Y}$-semi-integrable, and such that, for every $B \in \mathcal{B}_n$, $$ \int_B f\ d\overline{P_Y} = P\left(A\cap\{Y \in B\}\right). $$

Definition 16 Let $\mu$ be a measure on $\mathcal{B}_n$. We denote $\mu$'s support by $\mathrm{supp}_\mu$. In other words, $\mathrm{supp}_\mu$ consists of all $x \in \mathbb{R}^n$ such that, for every $\mathcal{E}_n$-open-neighborhood, $G$, of $x$, $\mu(G) > 0$.

Corollary 17 Let $(\Omega, \mathcal{F}, P)$ be a probability space, let $A \in \mathcal{F}$, and let $Y:\Omega \rightarrow \mathbb{R}^n$ be $\mathcal{F}/\mathcal{B}_n$-measurable. Set $\mu := P_Y$, and consider the measure $\nu:\mathcal{B}_n\rightarrow\mathbb{R}$ assigning to every $B \in \mathcal{B}_n$ $\nu(B) := P\left(A\cap\{Y \in B\}\right)$. Then $\mu, \nu$ can be extended to Radon outer-measures on $\mathbb{R}^n$, $\mu^*, \nu^*$, respectively, such that:

$D^{\nu^*}_{\mu^*} \in P(A\ |\ Y)$.
For every $y \in \mathrm{supp}_\mu$ at which some $f \in P(A\ |\ Y)$ is $\mathcal{E}_n/\mathcal{E}_1$-continuous, $y \in \mathrm{Diff}^{\nu^*}_{\mu^*}$.

Proof

Since $\mu, \nu$ are finite measures on $\mathcal{B}_n$ such that $\nu \ll \mu$, then, by lemma 11, they may be extended to Radon outer-measures on $\mathbb{R}^n$, $\mu^*, \nu^*$, respectively, such that $\nu^* \ll \mu^*$. Letting $Z \in \mathrm{\mathbf{Diff}}^{\nu^*}_{\mu^*}$, theorem 12 yields that $D^{\nu^*}_{\mu^*}\mathbb{1}_Z \in P(A\ |\ Y)$. Since, by choice of $Z$, $D^{\nu^*}_{\mu^*} = D^{\nu^*}_{\mu^*}\mathbb{1}_Z$ $P_Y$-a.e., the conclusion follows.
Let $y \in \mathrm{supp}_\mu$, and let $f \in P(A\ |\ Y)$ be $\mathcal{E}_n/\mathcal{E}_1$-continuous at $y$.

Let $\varepsilon \in (0,\infty)$. Choose $\delta \in (0,\infty)$ such that, for all $z \in B^{(n)}_\delta(y)$, $f(z) \in B^{(1)}_\varepsilon\left(f(y)\right)$. Let $r \in (0,\delta]$. Since $y \in \mathrm{supp}_\mu$, $P_Y\left(B^{(n)}_r(y)\right) > 0$, and we have $$ \begin{align} \frac{\nu^*\left(B^{(n)}_r(y)\right)}{\mu^*\left(B^{(n)}_r(y)\right)} &= \frac{\nu\left(B^{(n)}_r(y)\right)}{\mu\left(B^{(n)}_r(y)\right)} \\ &= \frac{P\left(A\cap\left\{Y\in B^{(n)}_r(y)\right\}\right)}{P_Y\left(B^{(n)}_r(y)\right)} \\ &= \frac{\int_{B^{(n)}_r(y)}\ f\ d\overline{P_Y}}{P_Y\left(B^{(n)}_r(y)\right)} \\ &<\frac{\int_{B^{(n)}_r(y)}\ f(y) + \varepsilon\ d\overline{P_Y}}{P_Y\left(B^{(n)}_r(y)\right)} \\ &= \frac{(f(y)+\varepsilon)\ \int_{B^{(n)}_r(y)}\ d\overline{P_Y}}{P_Y\left(B^{(n)}_r(y)\right)} \\ &= (f(y)+\varepsilon)\frac{\overline{P_Y}\left(B^{(n)}_r(y)\right)}{P_Y\left(B^{(n)}_r(y)\right)} \\ &= (f(y)+\varepsilon)\frac{P_Y\left(B^{(n)}_r(y)\right)}{P_Y\left(B^{(n)}_r(y)\right)} \\ &= f(y)+\varepsilon. \end{align} $$

Analogously, $$ f(y)-\varepsilon < \frac{\nu^*\left(B^{(n)}_r(y)\right)}{\mu^*\left(B^{(n)}_r(y)\right)}. $$

Q.E.D.

References

[1] Robert B. Ash, Basic Probability Theory, Dover, 2008. (An online version is freely available on the author's website.)

[2] Robert B. Ash, Catherine A. Doléance-Dade, Probability and Measure Theory, 2nd ed., Academic Press, 2000.

[3] Lawrence C. Evans, Ronald F. Gariepy, Measure Theory and Fine Properties of Functions, revised edition, CRC Press, 2015.

[4] William Feller, An Introduction to Probability Theory and Its Applications, Vol. 2, 2nd ed., John Wiley & Sons, 1971.

[5] Olav Kallenberg, Foundations of Probability Theory, 2nd ed., Springer, 2001.

[6] Sheldon M. Ross, A First Course in Probability, 9th ed., Pearson, 2013.

[7] James Yeh, Real Analysis : Theory of Measure and Integration, 3rd ed., World Scientific, 2014.

Best Answer

Related Solutions

[Math] Conditional Expectation given X is measurable wrt to sigma field

Conditional Probability on Zero Probability Events – Definition

Related Question