Let me expand here the point made in the the comments. First, I misread slightly your question. I had in mind two-parametric ergodic means given by
$$
A_{n,m}(f) = \sum_{i = 0}^{n - 1} \sum_{j = 0}^{m - 1} f(T^i x, T^j y).
$$
Not the means $A_{n}(f) = A_{n,n}(f)$. For the first you need to be in $L \log L(\Omega)$. For the second I do not have a definitive answer. It may be enough to be in $L^1$, see the edit bellow.
First, the following result is known.
Theorem: For every $f \in L \log L(\Omega,\mu)$ , where $(\Omega, \mu) = (\Omega_1 \times \Omega_2, \mu_1 \otimes \mu_2)$ and ergodic transformation $T_i:X_i \to X_i$, the two parametric ergodic averages
$$
A_{n,m}(f) = \sum_{i = 0}^{n - 1} \sum_{j = 0}^{m - 1} f(T^i x, T^j y)
$$
converge almost everywhere $A_{n,m}(f) \to f$.
The proof uses the following steps
- For each $i$ use the maximal weak-type $(1,1)$ maximal ergodic inequality.
- Use real interpolation to get that:
$$
\Big\| \sup_{n \geq 0} A^i_{n}(|f|) \Big\|_p \lesssim \max \Big\{ 1, \frac1{p - 1} \Big\} \, \| f \|_p
$$
- use Yano's extrapolation [Y] to obtain that the above maximal operator above maps $L \log L(\Omega)$ into $L^1(\Omega)$ for every $i$.
Then, copying the argument in [JMZ], you have that the map $f \mapsto f^\ast$ given by
$$
f^\ast(x,y) = \limsup_{n, m \to \infty} A_{n,m}(|f|)
$$
maps $L \log L(\Omega)$ into $L_1(\Omega)$. Just by composition, we have that
$$
\limsup_{n, m \to \infty} A_{n,m}(|f|)
\leq
\limsup_{n \to \infty} A^1_{n} \bigg( \underbrace{\sup_{m \geq 0} A_m^2|f|}_{g} \bigg)
$$
But the function $g$ is in $L^1(\Omega)$ as the function $f$ is in $L \log L$. Now, using that for every function in $L^1(\Omega)$ we have almost everywhere convergence (and therefore the limsup is exchangeable by the lim) we can conclude. The fact that $f^\ast$ is in $L^1$ gives almost everywhere convergence by the same argument that is used with maximal functions.
I will go further and conjecture that the following is true (probably known to the experts):
Open (to my knowledge) For every $\varphi$ with $\varphi \in o(x \log x)$ we have that there is an element $f \in L_\varphi(\Omega)$, such that $A_{n,m}(f) \not\to f$.
It is known, see [JMZ, Theorem 8]. That this holds in the case of differentiability of integrals. I will try to adapt the argument to the ergodic case. It is also likely to be true, since something similar holds for martingales [G].
[G] Gundy, R. F., On the class L log L, martingales, and singular integrals, Stud. Math. 33, 109-118 (1969). ZBL0181.44202.
[JMZ] Jessen, B.; Marcinkiewicz, J.; Zygmund, A., Note on the differentiability of multiple integrals., Fundamenta Math. 25, 217-234 (1935). ZBL61.0255.01.
[Y] Yano, Shigeki, Notes on Fourier analysis. XXIX. An extrapolation theorem, J. Math. Soc. Japan 3, 296-305 (1951). ZBL0045.17901.
P.D.: For your purposes, perhaps it will be enough if you take the usual ergodic averages with respect to
$$
S = p (\mathrm{id} \otimes T) + q (T \otimes \mathrm{id})
$$
for $p + q = 1$. Its ergodic averages will be weighted summations concentrated as Gaussian around the diagonal $i = j$.
Edit I found a (almost) complete solution in $L^1$ in the following paper:
You need to interpret $(i,j) \mapsto T^i \otimes T^j$ as an action of
$\mathbb{Z}^2$ and use that $[0,N]$ is a well-tempered Foelner
sequence (in the sense of that paper). That would give you almost
everywhere convergence for the means $A_{n,n}(f)$, for every $f \in
L^1(\Omega)$.
The issue is that you have the incorrect probability space. Remember, your space $\Omega$ needs to be rich enough to deal with sample paths, not just sample points. For example, $(X_1,X_2,X_3) = (0,0,0)$, $(0,1,0)$, and $(0,1,1)$ should all have positive probability for the Markov chain you have defined, and yet your probability space only has two points! The correct probability space to consider is
$$\Omega :=\{0,1\}^{\mathbb Z}$$
with the product $\sigma$-field, and the unique probability measure $\mathbb P$ satisfying
$$\mathbb P\Big(\big\{\omega=\{\omega_i\}_{i\in\mathbb Z}\in\Omega:\omega_i = x_i\text{ for }i\in\{n,n+1,\ldots,n+m\}\big\}\Big) = \mu(x_n)\prod_{i=1}^mp(x_{i-1},x_i)$$
for each $n\in\mathbb Z$, $m\in\mathbb N$, and $(x_n,\ldots,x_{n+m})\in\{0,1\}^{m+1}$, where $\mu = \frac12\delta_{0} + \frac12\delta_{1}$ is the invariant distribution of your Markov chain. Your measure preserving transformation $\phi$ is just the shift operator, i.e.
$$\phi\Big(\{x_i\}_{i\in\mathbb Z}\Big) = \{y_i\}_{i\in\mathbb Z} \quad \text{where }y_i=x_{i+1}$$
and your observable $X$ is the just the value of the $0$ component of your sample point $\omega$, i.e.
$$X\Big(\{\omega_i\}_{i\in\mathbb Z}\Big) := \omega_0.$$
Under this notation, $X_n(\omega)=\omega_n$, so $\{X_n\}$ is a Markov chain with initial distribution $\mu$ and transition probabilities given by $p$. By construction, this is an ergodic Markov chain, and $\phi$ is an ergodic transformation.
More generally, if you replace $\{0,1\}$ with a countable state space $E$ and let $p$ be a transition kernel on $E$ with invariant distribution $\mu$, then the measure preserving transformation $\phi$ will be ergodic if and only if the induced Markov chain is ergodic, so the (seemingly different) definitions are consistent.
Best Answer
When you pour a glop of syrup into your oatmeal, the contents of the bowl can be modelled as a set $X$ broken into a disjoint union of two measurable subsets $X = S \sqcup O$: $S$ represents the glop of syrup; and $O$ represents the original oatmeal.
Letting $\chi_S,\chi_O : X \to \{0,1\}$ be the characteristic functions, we have $\chi_S(x)+\chi_O(x)=1$ for all $x \in X$.
The volume measure on $X$ can be modelled as Lebesgue measure $\mu$ restricted to $X$, and normalized so that $\mu(X)=1$. You could then express both the syrup and the oatmeal as densities $\chi_S d\mu$, $\chi_O d\mu$.
The transformation itself is being modelled (somewhat unrealistically) as a measure preserving bijection $T : X \to X$.
I do not know what you mean by $f$, it is not otherwise mentioned in your post. But what I'll say is that $\mu$ itself is invariant under the transformation $T$, and so $\mu(A) = \mu(T(A))$ for all $A$, which can be written as $\int_A d\mu = \int_{T(A)} d\mu$.
One key point of ergodicity, as applied to this "syrup/oatmeal" picture, is that neither $S$ nor $O$ is invariant under $T$: neither $\mu(T^{-1}(S) \Delta S)$ nor $\mu(T^{-1}(O) \Delta O)$ is zero. To put it another way neither of the two characteristic functions $\chi_S$, $\chi_O$ is invariant under $T$.
Nonetheless, as one iterates $T$ more and more (as one stirs the oatmeal more and more), the real key point is that function $\chi_S \circ T^{-n}$ has a limit in some appropriate sense, that limit is $T$-invariant, and that limit is actually the constant function $\mu(S)$ with associated distribution $\mu(S) d\mu$ ($T$ will "distribute the syrup evenly throughout"). For this to work one has to take the limit of the function sequence $\chi_S \circ T^{-n}$ very carefully, usually in $L^2(X)$ or $L^1(X)$.
By the way, I have written my answer under the assumption that $T$ is a bijection, which allows me to be careless and write things like $\chi_S \circ T^{-n}$. One can certainly be more careful and rewrite all of this in the general case that $T$ is not a bijection.