Definition of Ergodicity in Theodoridis’ Machine Learning

ergodic-theorymachine learningprobabilitystatisticsstochastic-processes

This is related, but is not the same as https://stats.stackexchange.com/questions/319190/wide-sense-stationary-but-not-ergodic.

Note that I am not assuming stationarity. Theodoridis, in his Machine Learning (2nd ed., p. 44), states the following:

Definition 2.3 (Ergodicity). A stochastic process is said to be ergodic if the complete statistics can be determined by any one of the realizations.

Theodoridis then goes on to state that ergodicity implies stationarity with a vague explanation that I can't seem to parse through:

In other words, if a process is ergodic, every single realization carries identical statistical information and it can describe the entire random process. Since from a single sequence only one set of PDFs [probability density functions] can be obtained, we conclude that every ergodic process is necessarily stationary.

What does ergodicity mean in this situation? In particular, the "complete statistics" phrase is throwing me off. I get the feeling that Theodoridis isn't referring to the
$$\text{"}\mathbb{E}[g(T) \mid \theta] = 0 \implies \mathbb{P}(g(T) = 0 \mid \theta) = 1$$
for all $\theta$" definition of a complete statistic $T$ for $\theta$.

I recently completed a measure-theoretic probability sequence, so if this requires measure-theoretic probability tools to make precise, please use those tools in answering this question.

Some searching talks about how a "time-average is equal to the ensemble average," but I'm not really understanding this explanation and how it relates to the definition in Theodoridis' text.

Sources that I can use to supplement this with a more detailed coverage of the theory would be appreciated as well. Hamilton's Time Series Analysis (as well as most of my other texts and most coverage I find online on this topic) only covers ergodicity with the assumption that we have stationarity to begin with. Robert and Casella's Monte Carlo Statistical Methods only discuss ergodicity in the context of Markov chains.

Best Answer

This is simply a quite careless definition of ergodicity!

Here "complete" is used a loose way to mean that you can determine all that there is to determine about some statistics in question.

Surely, the author is talking about a specific statistics and indeed in the same page we have $\hat \mu_N$ and as written "where the sample mean operation can be significantly simplified".

Why is it quite careless then?

Because this is not how it works: we are not going to get the same value for all integers $N$, as the author implies, but can only approximate by taking $N$ large and we know that the approximation converges to the same value for any realization, that is, any choice of particular values (samples).

In more mathematical terms, this is indeed (a failed) try at simplifying the big picture of ergodic theory in which ergodicity is equivalent to the equality of time and space averages.