For (1), the forward direction is true and can be proved as you suggest, provided $X,Y$ are integrable. The converse is false even for ordinary independence (where $\Sigma$ is the trivial $\sigma$-field $\{\Omega, \emptyset\}$ or $Z = c$); we have $E[XY] = E[X] E[Y]$ iff $X,Y$ are uncorrelated, which is weaker than independence. (Wikipedia gives as an example $X \sim U(-1,1), Y=X^2$).
What is true is that $X,Y$ are conditionally independent given $\Sigma$ (where $\Sigma = \sigma(Z)$ is a special case) iff $$E[f(X) g(Y) | \Sigma] = E[f(X) | \Sigma] E[g(Y)|\Sigma]$$ for all $f,g$ in some suitably large class of measurable functions from $\mathbb{R}$ to $\mathbb{R}$. For instance, one could take:
- all bounded measurable functions
- indicator functions of measurable sets
- bounded continuous functions
- smooth and compactly supported functions
- indicators of open/closed/half-open intervals or half-lines.
For (2), this is false. You can get counterexamples to both directions by taking $X,Y$ to be not (unconditionally) independent and considering the extreme cases where $\Sigma$ is trivial or $\Sigma = \mathcal{F}$.
I will try to answer my question above; it would be great if someone can confirm (since again I did not find any textbook describing this, except the one Exercise in Billingsley mentioned below)!
To set things up, let $(\Omega, \mathcal{F}, \mathbb P)$ be a probability space. $A \in \mathcal{F}$ is an event with probability $\mathbb P(A) > 0$, $X: \Omega \to \mathbb R$ is a random variable and $\mathcal{G} \subset \mathcal{F}$ is a sub-$\sigma$-algebra.
We are interested in defining: $\mathbb E[X \mid A, \mathcal{G}]$. There are two "natural" ways to do this.
First, we will do this by just using the standard definition of conditional expectation but with respect to the measure $\mathbb P_A$, where this is just the conditional probability measure with mass $P_A(B)$ on $\mathcal{F}$-measurable sets $B$:
$$ \mathbb P_A(B) = \frac{\mathbb P(A \cap B)} {\mathbb P(A)}$$
Thus we define $\mathbb E[X \mid A, \mathcal{G}]$ for $X \in L^1(\mathbb P_A)$ by the following properties:
- $\mathbb E[X \mid A, \mathcal{G}]$ is $\mathcal{G}$ measurable.
- $\int_B \mathbb E[X \mid A, \mathcal{G}] d\mathbb P_A = \int_B X d\mathbb P_A $ for all $\mathcal{G}$-measurable sets $B$.
We can quickly see that to check $X \in L^1(\mathbb P_A)$, it is sufficient to check $X \in L^1(\mathbb P)$ while for property 2. we can just check:
$\int_B \mathbb E[X \mid A, \mathcal{G}] d\mathbb P = \int_B X d\mathbb P $ for all sets $B \in \{ G \cap A \mid G \in \mathcal{G}\}$
The second way of defining $\mathbb E[X \mid A, \mathcal{G}]$ is by defining it for indicator variables of $\mathcal{F}$-measurable sets $B$ as (also see related math.se post):
$$ \mathbb E[ \mathbf{1}_{B} \mid A , \mathcal{G}] = \frac{\mathbb E[ \mathbf{1}_{B}\mathbf{1}_{A} \mid \mathcal{G}] }{\mathbb E[ \mathbf{1}_{A} \mid \mathcal{G}]}$$
By exercise 34.4 a) in the book "Probability and Measure" by Billingsley, we get that in fact these two definitions are equivalent. So we are good to go.
Now we are still interested in the calculus of such conditional expectations. It turns out to be simple, since we can just use the standard calculus where the expectations are taken w.r.t. to the measure $\mathbb P_A$! Also properties such as "on the event $A$, $X$ is independent of $\mathcal{G}$", also just mean that $X$ is independent of $\mathcal{G}$ under the measure $\mathbb P_A$.
Best Answer
Notice that the random variable $\mathbb{E}(X|A)$ is $\sigma(A)$-measurable, so, being $\sigma(A)$ and $\sigma(B)$ independent: $$\mathbb{E}\left({\mathbb{E}}\left(X|A\right)|B\right)=\mathbb{E}\left({\mathbb{E}}\left(X|A\right)\right)=\mathbb{E}(X),$$ so it is $\mathbb{P}$-a.e. constant.
On the other hand, for example, if $X$ is $\sigma(A,B)$-measurable and it is not $\mathbb{P}$-a.e. constant, then: $$\mathbb{E}\left(X|\sigma(A,B)\right)=X\neq\mathbb{E}(X).$$
So, where did your intuition fall? When you get $\mathbb{E}(X|A)$, you get the best prediction knowing $A$ of $X.$ Now, when you get $\mathbb{E}(\mathbb{E}(X|A)|B)$, you get the best prediction knowing $B$ of [the best prediction knowing $A$ of $X$] and not the best prediction knowing $A$ and $B$ of $X$. While the first prediction has to be constant (because it is a prevision made on information independent on the quantity you are estimating, and so, knowing useless information, the best prediction you can make is to take the best prediction you can make when you know nothing at all, i.e. the expectation of the quantity) the second could be very well non-constant.