My impression is that this question does not have a unique, fully general answer, so I will only explore the simplest case, and in a bit informal way.
Assume that the true Data Generating Mechanism is
$$y_t = y_{t-1} + u_t,\;\; t=1,...,T,\;\; y_0 =0 \tag{1}$$
with $u_t$ a usual zero-mean i.i.d. white noise component, $E(u_t^2)= \sigma^2_u$ . The above also imply that
$$y_t = \sum_{i=1}^tu_i \tag{2}$$
We specify a model, call it model $A$
$$y_t = \beta y_{t-1} + u_t,\;\; t=1,...,T,\;\; y_0 =0 \tag{3}$$
and we obtain an estimate $\hat \beta$ for the postulated $\beta$ (let's discuss the estimation method only if need arises).
So a $k$-steps-ahead prediction will be
$$\hat y_{T+k} = \hat \beta^k y_T \tag{4}$$
and its MSE will be
$$MSE_A[\hat y_{T+k}] = E\left(\hat \beta^k y_T-y_{T+k}\right)^2 $$
$$=E\left[(\hat \beta^k-1) y_T -\sum_{i=T+1}^{T+k}u_i \right]^2 = E\big[(\hat\beta^k-1)^2 y_T^2\big]+ k\sigma^2_u \tag{5}$$
(the middle term of the square vanishes, as well as the cross-products of future errors).
Let's now say that we have differenced our data, and specified a model $B$
$$\Delta y_t = \gamma \Delta y_{t-1} + u_t \tag{6}$$
and obtained an estimate $\hat \gamma$. Our differenced model can be written
$$y_t = y_{t-1} + \gamma (y_{t-1}-y_{t-2}) + u_t \tag{7}$$
so forecasting the level of the process, we will have
$$\hat y_{T+1} = y_{T} + \hat \gamma (y_{T}-y_{T-1})$$
which in reality, given the true DGP will be
$$\hat y_{T+1} = y_{T} + \hat \gamma u_T \tag {8}$$
It is easy to verify then that, for model $B$,
$$\hat y_{T+k} = y_{T} + \big(\hat \gamma + \hat \gamma^2+...+\hat \gamma^k)u_T $$
Now, we reasonably expect that, given any "tested and tried" estimation procedure, we will obtain $|\hat \gamma|<1$ since its true value is $0$, except if we have too few data, or in very "bad" shape. So we can say that in most cases we will have
$$\hat y_{T+k} = y_{T} + \frac {\hat \gamma - \hat \gamma ^{k+1}}{1-\hat \gamma}u_T \tag{9}$$
and so
$$MSE_B[\hat y_{T+k}] =
E\left[\left(\frac {\hat \gamma - \hat \gamma ^{k+1}}{1-\hat \gamma}\right)^2u_T^2\right] + k\sigma^2_u \tag{10}$$
while I repeat for convenience
$$MSE_A[\hat y_{T+k} ] = E\big[(\hat\beta^k-1)^2 y_T^2\big]+ k\sigma^2_u \tag{5}$$
So, in order for the differenced model to perform better in terms of prediction MSE, we want
$$MSE_B[\hat y_{T+k}] \leq MSE_A[\hat y_{T+k}]$$
$$\Rightarrow E\left[\left(\frac {\hat \gamma - \hat \gamma ^{k+1}}{1-\hat \gamma}\right)^2u_T^2\right] \leq E\big[(\hat\beta^k-1)^2 y_T^2\big] $$
As with the estimator in model $B$, we extend the same courtesy to the estimator in model $A$: we reasonably expect that $\hat \beta$ will be "close to unity".
It is evident that if it so happens that $\hat \beta >1$, the quantity in the right-hand-side of the inequality will tend to increase without bound as $k$, the number of forecast-ahead steps, will increase. On the other hand, the quantity on the left-hand side of the desired inequality, may increase as $k$ increases, but it has an upper bound. So in this scenario, we expect the differenced model $B$ to fair better in terms of prediction MSE compared to model $A$.
But assume the more advantageous to model $A$ case, where $\hat \beta <1$. Then the right-hand side quantity also has a bound. Then as $k \rightarrow \infty$ we have to examine whether
$$E\left[\left(\frac {\hat \gamma}{1-\hat \gamma}\right)^2u_T^2\right] \leq E\big[y_T^2\big]= T\sigma^2_u\;\; ??$$
(the $k \rightarrow \infty$ is a convenience -in reality both magnitudes will be close to their suprema already for small values of $k$).
Note that the term $ \left(\frac {\hat \gamma }{1-\hat \gamma}\right)^2$ is expected to be "rather close" to $0$, so model $B$ has an advantage from this aspect.
We cannot separate the remaining expected value, because the estimator $\hat \gamma$ is not independent from $u_T$. But we can transform the inequality into
$$\operatorname{Cov}\left[\left(\frac {\hat \gamma}{1-\hat \gamma}\right)^2,\,u_T^2\right] + E\left[\left(\frac {\hat \gamma}{1-\hat \gamma}\right)^2\right]\cdot \sigma^2_u \leq T\sigma^2_u\;\; ??$$
$$\Rightarrow \operatorname{Cov}\left[\left(\frac {\hat \gamma}{1-\hat \gamma}\right)^2,\,u_T^2\right] \leq \left (T-E\left[\left(\frac {\hat \gamma}{1-\hat \gamma}\right)^2\right]\right)\cdot \sigma^2_u \;\; ??$$
Now, the covariance on the left-hand side is expected to be small, since the estimator $\hat \gamma$ depends on all $T$ errors. On the other side of the inequality, $\hat \gamma$ comes from a stationary data set, and so the expected value of the above function of it is expected to be much less than the size of the sample (since more over this function will range in $(0,1)$).
So in all, without discussing any specific estimation method, I believe that we were able to show informally that the differenced model should be expected to perform better in terms of prediction MSE.
I suspect there is no general term that will cover all cases. Consider, for example, a white noise generator. In that case, we would just call it white noise. Now if the white noise comes from a natural source, e.g., AM radio band white noise, then it has effects including superimposed diurnal, seasonal, and sun-spot (11 year) solar variability, and man made primary and beat interference from radio broadcasts.
For example, the graph in the link mentioned by the OP looks like amplitude modulated white noise, almost like an earthquake. I personally would examine such a curve in the frequency and or phase domain, and describe it as an evolution of such in time because it would reveal a lot more about the signal structure by direct observation of how the amplitudes over a set of ranges of frequencies evolve in time with respect to detection limits as opposed to thinking about stationarity, mainly by reason of conceptual compactness. I understand the appeal of statistical testing. However, it would take umpteen tests and oodles of different criteria, as in the link, to incompletely describe an evolving frequency domain concept making the attempt at developing the concept of stationarity as a fundamental property seem rather confining. How does one go from that to Bode plotting, and phase plotting?
Having said that much, signal processing becomes more complicated when a "primary" violation of stationarity occurs; patient dies, signal stops, random walk continues, and so forth. Such processes are easier to describe as a non-stationarity than variously as an infinite sum of odd harmonics, or a decreasing to zero frequency. The OP complaint about not having much literature to document secondary stationarity is entirely reasonable; there does not seem to be complete agreement as to what even constitutes ordinary stationarity. For example, NIST claims that "A stationary process has the property that the mean, variance and autocorrelation structure do not change over time." Others on this site claim that "Autocorrelation doesn't cause non-stationarity," or using mixture distributions of RV's that "This process is clearly not stationary, but the autocorrelation is zero for all lags since the variables are independent." This is problematic because auto-non-correlation is typically "tacked-on" as an additional criterion of non-stationarity without much consideration given to how necessary and sufficient that is for defining a process. My advice on this would be first observe a process, and then to describe it, and to use phrases crouched in modifiers such as, "stationary/non-stationarity with respect to" as the alternative is to confuse many readers as to what is meant.
Best Answer
For time series, you want to use LSTM or other recurrent neural network instead of unsupervised generative models in deep learning. You can also convert it to a supervised model which uses $x_t$ to predict $x_{t+1}$ with whichever machine learning algorithm you like.
As you can see in the blog post and in wikipedia of generative model: "In probability and statistics, a generative model is a model for randomly generating observable data values, typically given some hidden parameters." You should note the input to the generative model is some hidden parameters which are usually meaningless, while you want the input to be $x_t$ when generating $x_{t+1}$.
You should also note in the blog post of OpenAI that "Autoregressive models such as PixelRNN instead train a network that models the conditional distribution of every individual pixel given previous pixels (to the left and to the top). This is similar to plugging the pixels of the image into a char-rnn, but the RNNs run both horizontally and vertically over the image instead of just a 1D sequence of characters." So the PixelRNN is basically like a recurrent neural network except that the sequence is on 2D directions while you only have time as your 1D direction. This still just leads you to LSTM.
If you really really really want to use fancy algorithms such as GAN, here is what you can do: forget you have a time series. Usually when time series data are generated, the data are generated at one step after another according to $p(x_{t+1}|x_{1..t})$, which is like PixelRNN. Instead, just think about the data as a bunch of numbers, or a 1D image in analogy to image generation, and now you have a perfect analogy to image generation using GAN: each whole time series is a single training data for you GAN. This method totally ignores some characteristics of time series, for example causality, and just regards your data as a bunch of numbers. I highly doubt this will work well but I encourage you to try.