Consider two Bernoulli variables (with values in $\{0,1\}$) $X,Y$. We have
$$Cov(X,Y) = E(X,Y) - E(X) E(Y) = P(X=1,Y=1) - P(X=1) P(Y=1)$$
Then, $Cov(X,Y) = 0$ implies $P(X=1,Y=1) = P(X=1) P(Y=1)$ which further implies $P(X,Y) = P(X)P(Y)$. That is, for Bernoulli variables non-correlation implies independence.
Now, your $X,Y,Z,W$ are Bernoulli variables (except for a scaling-shifting that doesn't alter covariance/independence) pairwise uncorrelated, hence they are pairwise independent.
This question actually helped me clarify the meaning of linear regression, and in particular, distinguish it from the condition expectation. So I'm going to go ahead and answer it myself, but hopefully someone else can chime in with a different answer (possibly with a different application). Also, since I'm answering this for myself, it's a bit more detailed/reflective than it probably needs to be, and may not be 100% error-free.
If $X\in L^{2}(\sigma(X))$, then we can consider the linear span $\mathcal{L}(S_{X})$ of the set $S_{X}=\{\mathbb{1}_{\Omega},X\}$ (note that $\mathbb{1}_{\Omega}\equiv1$ is $\sigma(X)$-measurable and that $S_{X}$ is linearly independent if $X$ is not a constant):
$$\mathcal{L}(S_{X})=\{Z\in L^{2}(\sigma(X))\;|\;Z(\omega)=a1_{\Omega}(\omega)+bX(\omega)=a+bX\}.$$
Since $|S_{X}|=2<\infty$, $\mathcal{L}(S_{X})$ is a closed linear subspace of $L^{2}(\sigma(X))\subset L^{2}(\Omega)$, and so we can apply Hilbert Space theory (actually, in this case finite-dimensional linear algebra) to find the unique $\hat{Y}\in\mathcal{L}(S_{X})$ such that
$$||Y-\hat{Y}||_{2}=\inf_{Z\in\mathcal{L}(S_{X})}||Y-Z||_{2},$$
which we know is
$$\hat{Y}(\omega)=(\mathbb{proj}_{\mathcal{L}(S_{X})}Y)(\omega).$$
To compute this in the usual way, we need to first apply the Gram-Schmidt procedure to $S_{X}$ since $(\mathbb{1}_{\Omega},X)_{2}=EX\neq0$ (i.e., $S$ is not an orthogonal set). Since $\mathbb{proj}_{1_{\Omega}}X=X-EX:=X-\mu_{X}$, we simply transform $S_{X}$ into the set
$$S_{X}\mapsto\{1_{\Omega}, X-\mu_{X}\}.$$
Now $S_{X}$ is orthogonal and we have (using the usual notation $\mathbb{Var}=\mathbb{E}[(X-\mu)^{2}]=\sigma^{2}_{X}$ and $\mathbb{Cov}(X,Y)=\mathbb{E}[(X-\mu_{X})(Y-\mu_{Y})]=\sigma_{XY}$):
$$\begin{align*}
\hat{Y}(\omega)
&=(\mathbb{proj}_{1_{\Omega}}Y)(\omega)+(\mathbb{proj}_{X-\mu_{X}}Y)(\omega)
\\&=\frac{(1_{\Omega},Y)_{2}}{||1_{\Omega}||_{2}^{2}}1_{\Omega}(\omega)+\frac{(X-\mu_{X},Y)_{2}}{||X-\mu_{X}||_{2}^{2}}(X(\omega)-\mu_{X})
\\&=\mu_{Y}-\frac{\mathbb{E}[Y(X-\mu_{X})]}{\sigma^{2}_{X}}\mu_{X}+\frac{\mathbb{E}[Y(X-\mu_{X})]}{\sigma^{2}_{X}}X(\omega)
\\&=\mu_{Y}-\frac{\sigma_{XY}}{\sigma^{2}_{X}}\mu_{X}+\frac{\sigma_{XY}}{\sigma^{2}_{X}}X(\omega)
\\&:=\mu_{Y}-b\mu_{X}+bX(\omega)
\\&:=a+bX(\omega)
\\&\in\mathcal{L}(S_{X})
\end{align*}
$$
This is of course the usual linear regression formula. Note that at the end we used the initially surprising identity
$$\mathbb{Cov}(X,Y)=\mathbb{E}[(X-\mu_{X})(Y-\mu_{Y}]=\mathbb{E}[X(Y-\mu_{Y})]=\mathbb{E}[(X-\mu_{X})Y]$$
Note that the addition of $1_{\Omega}$ to $S_{X}$ is analogous to the addition of the constant $e^{i0}=1$ in Fourier analysis (and in dealing with any other orthonormal function basis). In particular, it frees us of the requirement that $\hat{Y}$ "pass through the origin" so that $\hat{Y}$ can be an affine linear transformation of $X$. Note that we get formula (2) in my question when we don't include $1_{\Omega}$ and just project (regress) onto $X$:
$$\hat{Y}(\omega)=\frac{(X,Y)}{||X||^{2}_{2}}X(\omega)$$
or if we want to project onto $X-\mu_{X}$ instead:
$$\hat{Y}(\omega)=\frac{\sigma_{XY}}{\sigma^{2}_{X}}(X(\omega)-\mu_{X})=b\mu_{X}+bX(\omega).$$
Now, in general we don't have $\mathcal{L}(S_{X})=L^{2}(\sigma(X))$, and so we don't generally expect that $\hat{Y}=\mathbb{E}[Y|X]$ for any given $X$. In particular, we expect to have
$$||Y-\hat{Y}||_{2}\geq||Y-\mathbb{E}[Y|X]||_{2}$$
because the selection of $\hat{Y}$ is from a smaller subspace than that of $\mathbb{E}[Y|X]$, even though the same estimator $X$ is being used.
It is a well-known fact that if $Z$ and $X$ are mutually measurable ($\sigma(Z)=\sigma(X)$), then there exists an invertible Borel function $f$ such that $Z=f(X)$ and $X=f^{-1}(X)$. Furthermore, there exists a Borel function $g$ for any $\sigma(X)$-measurable $Z$ such that $\mathbb{E}[Y|X]=g(Z)$ (these facts help clarify that conditional expectation is projection into a subspace defined by a $\sigma$-algebra, not necessarily a particular random variable). Therefore, we can always express $\mathbb{E}[Y|X]$ in terms of any generic $Z$ in $L^{2}(\sigma(X))$ by finding the function $f$ and $g$ (which exist basically as a consequence of how measure-theoretic probability model)
$$\mathbb{E}[Y|\sigma(X)](\omega)=\mathbb{E}[Y|X](\omega)=g(Z(\omega))=g(f(X(\omega))):=h(X(\omega)).$$
A natural question is then under what circumstances do we have $h(x)=a+bx$ so that
$$E[Y|X](\omega)=a+bX(\omega)?$$
It is obvious that $h(x)=ax+b$ for some choice of $X\in L^{2}(\sigma(X))$ and that $a$ and $b$ must coincide with the above due to the optimization constraint defining $\mathbb{E}[Y|X]$ (in particular, $X$ equal to $\mathbb{E}[Y|X]$ itself or any $Z$ that is equal to $\mathbb{E}[Y|X]$ upto an affine transformation).
However, (in the real world) it is unlikely we would be so lucky as to happen upon such a nice predictor variable $X$ (or have sample data available if we knew what it was). Nevertheless, it is quite common to see people claim that $\mathbb{E}[Y|X]=\hat{Y}$ in practice. That this is justified derives from a number of assumptions in a variety of different contexts and is almost always an approximation when in the realm of statistical inference (for instance, one might start out by forcing some structure or model on $Y$, e.g., $Y=a+bX+\epsilon$, and then obtain under some further assumptions $\hat{Y}=\mathbb{E}[Y|X]$ as a consequence). But at least from a probability theory perspective (where we have the full population data, i.e. the random variables themselves and their distributions), we can say something definitive.
If $(X,Y)$ are jointly normally distributed, then $\mathbb{E}[Y|X]=\hat{Y}$
The proof follows from considering the minimal-variance "hedge" $W:=Y-\frac{\sigma_{XY}}{\sigma_{X}^{2}}X$ and observing that $W$ and $X$ are independent (this follows from $\mathbb{Cov}(X,W)=0$ since $(X,W)$ are jointly normal as well) and calculating $\mathbb{E}[Y|X]$ with the usual rules (e.g., factoring independent variables).
Thus, if $Y$ is normally distributed, we can find $\mathbb{E}[Y|\mathcal{G}]$ by finding an $X\in L^{2}(\mathcal{G})$ such that $(X,Y)$ are jointly normal. But in general, we still face the same problem that we don't generally have the proper $X\in L^{2}(\sigma(X))$ such that $\hat{Y}=\mathbb{E}[Y|X]$. Nevertheless, we can summarize the above as follows.
Given $X$ (or a $\sigma$-algebra $\mathcal{G}$), we want to find $\mathbb{E}[Y|X]\in L^{2}(\sigma(X))$ such that the $L^{2}$-norm optimization constraint holds. If we are given $\sigma(X)$-measureable variables $X_{1},X_{2},\ldots,X_{n}$ then we can estimate $\mathbb{E}[Y|X]$ (and by extension $Y$ itself) using these random variables by projecting onto $\mathcal{L}(S_{X_{1},\ldots,X_{n}})$ (after performing an orthogonalization procedure and eliminating any linearly independent random variables). This has been illustrated for the cases $n=0$ and $n=1$. In particular,
$$\hat{Y}_{0}(\omega)\equiv\mu_{Y}$$
and
$$\hat{Y}_{1}(\omega)=\mu_{Y}+\frac{\sigma_{XY}}{\sigma^{2}_{X}}(X(\omega)-\mu_{X})$$
and then so on for $\hat{Y}_{2},\ldots,\hat{Y}_{n}$. If $\mathbb{dim}(L^{2}(\sigma(X))=n<\infty$, then we will have $\hat{Y}_{n}=\mathbb{E}[Y|X]$. Although, given the difficulty in visualizing these $\sigma$-algebras and the resulting $L^{2}$ spaces, it's hard to say for sure when the dimension is finite and whether an orthonormal basis can be easily found in either the finite-dimensional or infinite-dimensional cases. One probably has to eventually fall back onto the measure-theoretic definition of $\mathbb{E}[Y|X]$ as a Radon-Nikodym derivative and the guess-and-check strategy (at least in probability theory - in statistics you do things like introduce additional regressors to span a bigger subset of $L^{2}(\sigma(X))$, search for a better predictor function $g=g(X_{1},\ldots,X_{n})$ of the current regressors already in play, or make extra modelling assumptions).
Best Answer
Your argument is not valid. Conditional expectation is the projection on the space of all random variables measurable w.r.t. $\sigma \{Z_s:s\leq t\}$, not the projection on to the vector space spanned by $ \{Z_s:s\leq t\}$. The former space contains many nonlinear functions of $ \{Z_s:s\leq t\}$ like $Z_t^{2}$.