The transformation $\theta$ on $\Omega^{\Bbb N}$ is ergodic. Indeed, it's enough to show that for each cylinder $A$ and $B$, we have
$$\frac 1n\sum_{k=0}^{n-1}\mu(\theta^{-k}A\cap B)\to \mu(A)\mu(B),$$
where $\mu$ is the measure on the product $\sigma$-algebra.
If $A=\prod_{j=0}^NA_j\timesĀ \Omega\times\dots$ and $B=\prod_{j=0}^NB_j\times \Omega\times\dots$, we have for $k>N$
\begin{align}
\theta^{-k}A\cap B&=\{(x_j)_{j\geq 0}, (x_{j+k})_{j\geq 0}\in A, (x_j)_{j\geq 0}\in B\}\\
&=\{(x_j)_{j\geq 0},x_{j+k}\in A_j, 0\leq j\leq N, x_j\in B_j,0\leq j\leq N\}\\
&=B_0\times \dots\times B_N\times \Omega\times\dots\times \Omega\times A_0\times\dots\times A_n\times \Omega\times\dots,
\end{align}
and we use the definition of product measure $\mu$ on cylinders (the $N$ first terms doesn't matter).
Since $\theta$ is ergodic, $\mathcal J_{\theta}$ consists only of events of measure $0$ or $1$. The conditional expectation with respect such a $\sigma$-algebra is necessarily constant.
Your question is intuitive, so I will try to answer through one very intuitive example.
Example:
Dynamics:
You have some money, say 100$, and we're playing a coin game with a fair coin. Each time the coin lands heads (H) we take your money and multiply them by 1.5 and every time the coin lands tails (T) we take your money and multiply them by 0.6.
Averages:
Let W(t) denote your wealth at time t, and let this process run for a finite time and take the average over your wealth for that time. This is the finite time-average:
$\left\langle W(t)\right\rangle _{T} = \frac{1}{T}\sum_{t=0}^{T}W(t) $
In contrast, assume that we have N of these processes that we let run until t and then take the average over N. This is the finite ensemble-average of the observable N wealth processes.
$\left\langle W(t)\right\rangle _{N} = \frac{1}{N}\sum_{i=1}^{N}W_i(t)$
Where i denotes the ith of N processes and the average is taken at time t. Letting $T\rightarrow\infty$ and $N\rightarrow\infty$ you get the ensemble average (or expectation operator) and the time average.
Now that we have introduced the relevant dynamics (the coin game) and the averages, lets restate your definition of ergodicity - namely that the time average equals the ensemble average.
1) You are correct in asserting that each of the N coin tossing sequences have a time-average. However, in the limit, all of those N time averages are equal exactly because they are governed by the same dynamic. And they will all converge to the absorbing boundry (zero) with probability one. If you find that unintuitive, try to simulate a bunch of those processes, and plot their distribution. What you will get is a log-normal with diverging moments. E.g. there will be one-in-a-million who's wealth grows exponentially. The bulk of the ensemble will be close to zero.
2) Is the above wealth generating process ergodic? No. The ensemble average will predict infinite positive growth$^*$ while the time-average will converge to zero.$^{**}$ This is commonly known as the St. Petersburg paradox. Unrelated to your question, but both interesting and important: It is possible to create an ergodic observable using the logarithm which solves the St. Petersburg paradox.
Hope you can use this. If you want to see the formal proofs and simulations, I can recommend:
$^{*}$ Ensemble average: $\frac12\times0.6+\frac12\times1.5=1.05$ a number larger than one, reflecting positive growth of the ensemble.
$^{**}$ Time average: $x(t)=r_1^{n_1}r_2^{n_2}$ where $r_1$ and $r_2$ are the two rates and $n_1$ and $n_2$ are the frequences that the wealth process gets subjected to the rates. The limit of $x(t)$ is then for $t\rightarrow\infty$ $x(t)^{1/t} = (r_1r_2)^{1/2}$ or $\sqrt(0.9)\approx0.95$ e.g. a number less than one, which is decay in the long time limit.
Video
Article
Best Answer
A stochastic process is a collection $\{X_t\}_t$ of real-valued (for this discussion) random variables defined on a common probability space $(\Omega,\mathcal{F},P)$. For this discussion, we'll have the indexing be over the natural numbers $\mathbb{N}$: $(X_n)_{n \ge 1}$. For the definitions we consider, we insist that each $X_n$ has the same mean, denoted $E[X]$. A process $(X_n)_n$ is stationary if, for any $m \ge 1$ and any Borel sets $A_1,\dots,A_m$, it holds that $P(X_1 \in A_1, \dots, X_m \in A_m) = P(X_2 \in A_1,\dots,X_{m+1} \in A_m)$; in other words, $(X_n)_n$ is stationary if the law of $(X_1,X_2,X_3,\dots)$ is the same as the law of $(X_2,X_3,\dots)$ (where the value space is $\mathbb{R}^\mathbb{N}$ with product sigma-algebra).
We say that $(X_n)_n$ is ergodic if $(X_n)_n$ is stationary and $(\mathbb{R}^\mathbb{N},\mathcal{B}^\mathbb{N},P^\mathbb{N},T)$ is ergodic as a measure preserving system, where $P^\mathbb{N}(A) := P((X_1(\omega),X_2(\omega),\dots) \in A)$ and $T$ is the left shift: $T((x_1,x_2,x_3,\dots)) := (x_2,x_3,\dots)$ (note that $(X_n)_n$ stationary implies $T$ preserves $P^\mathbb{N}$). A consequence of this notion of ergodicity is that for each measurable $f: \mathbb{R} \to \mathbb{R}$, it holds that $\frac{1}{N}\sum_{n \le N} f(X_n(\omega)) \to E[f(X)]$ with probability $1$.
We say that $(X_n)_n$ is mean-ergodic if $E\left[|\frac{1}{N}\sum_{n \le N} X_n - E[X]|^2\right] \to 0$ as $N \to \infty$. Note that $(X_n)_n$ need not be stationary.
We say that $(X_n)_n$ is pointwise-ergodic if $\frac{1}{N}\sum_{n \le N} X_n(\omega) \to E[X]$ with probability $1$. Note that ergodic implies pointwise-ergodic.
We say that $(X_n)_n$ is LLN-applicable if the $X_n's$ are i.i.d. and the set of $\omega \in \Omega$ satisfying $\frac{1}{N}\sum_{n \le N} X_n(\omega) \to E[X]$ has measure/probability $1$.
It is not clear what is usually meant by "ergodic". Wikipedia uses "ergodic" to mean mean-ergodic.
As the comments indicate, there are $(X_n)_n$ that are pointwise-ergodic but not i.i.d. and thus not LLN-applicable. For example, we can take any $Y_n$'s that are i.i.d. and then consider $(X_1,X_2,\dots) = (Y_1,Y_1,Y_2,Y_3,\dots)$.
This answer gives an example of $(X_n)_n$ that are stationary, pointwise-ergodic, but not ergodic. The example is also mean-ergodic if $E[Y_i] = 0$ and $E[Y_i^2] < \infty$.
Let $Y_1,Y_2,\dots$ be independent with, for each $i$, $P(Y_i = 1) = \frac{1}{2}$ and $P(Y_i = -1) = \frac{1}{2}$. Let $X_n = \frac{Y_1+\dots+Y_n}{\sqrt{n\log\log\log n}}$. Then indeed $E[X_n]$ is the same for each $n$ (the common mean is $0$) and $(X_n)_n$ is mean-ergodic, since $E[|X_n-0|^2] = E[X_n^2] = \frac{n}{n\log\log\log n} \to 0$. However, $(X_n)_n$ is not pointwise-ergodic, since with probability $1$, $\limsup_{n \to \infty} \frac{Y_1+\dots+Y_n}{\sqrt{2n\log\log n}} = 1$. (It's obviously not ergodic, since it's not stationary).
If the $X_n$'s are bounded, then the dominated convergence theorem implies that pointwise-ergodic implies mean-ergodic.
A Markov chain, for this discussion, is a stochastic process $(X_n)_{n \ge 1}$ taking values in a common, finite set $X$ such that (1) $P(X_n=x) > 0$ for each $n \ge 1$ and $x \in X$ and (2) for each $n,x_1,\dots,x_n$, it holds that $P(X_n = x_n | X_{n-1} = x_{n-1},\dots,X_1=x_1) = P(X_n = x_n | X_{n-1} = x_{n-1})$. We will make the common assumption/imposition that the Markov chain is time homogeneous, meaning $P(X_n = x_n | X_{n-1} = x_{n-1})$ is independent of $n$ and also that $P(X_1 = x) = P(X_2 = x)$ for each $x \in X$. An easy exercise shows that the time homogeneity assumption implies that $(X_1,X_2,\dots)$ is stationary (hint: use induction).
Associated to a Markov chain, we can define a weighted, directed graph on $X$ with the weight of the directed edge from $x$ to $y$ being $P(X_2 = y | X_1 = x)$ if the probability is positive (if it's not, don't draw a directed edge).
A Markov chain is strongly connected if one can get from one vertex to any other vertex, travelling along the graph.
.
$\bf{Claim}$: A Markov chain is ergodic if and only if it is strongly connected.
Proof: Suppose the Markov chain is ergodic. Take any $x_0,x_1 \in X$. Let $f = 1_{x_1}$. Then, with probability $1$, $\frac{1}{N}\sum_{n \le N} f(X_n(\omega)) \to E[f(X)] = P(X = x_1)$. Since $P(X=x_1) > 0$, $P(\exists n \ge 2 : X_n = x_1 | X_1 = x_0) > 0$, which, by countable additivity, means there is some $n \ge 2$ with $P(X_n = x_1 | X_1 = x_0) > 0$, but $P(X_n = x_1 | X_1 = x_0) = \sum_{y_2,\dots,y_{n-1}} P(X_n = x_1 | X_{n-1} = y_{n-1})\dots P(X_2 = y_2 | X_1 = x_0)$, so there is some $y_2,\dots,y_{n-1}$ with each of the terms in the product being nonzero, which means there is a path from $x_0$ to $x_1$. Therefore, the Markov chain is strongly connected. The other direction uses Perron-Frobenius and some ergodic theory, which I omit.