In short, I think they are working in different learning paradigm.
State-space model (hidden state model) and other stateless model you mentioned are going to discover the underlying relationship of your time series in different learning paradigm: (1) maximum-likelihood estimation, (2) Bayes' inference, (3) empirical risk minimization.
In state-space model,
Let $x_t$ as the hidden state, $y_t$ as the observables, $t>0$ (assume there is no control)
You assume the following relationship for the model:
$P(x_0)$ as a prior
$P(x_t | x_{t-1})$ for $t \geq 1$ as how your state change (in HMM, it is a transition matrix)
$P(y_t | x_t)$ for $t \geq 1$ as how you observe (in HMM, it could be normal distributions that conditioned on $x_t$)
and $y_t$ only depends on $x_t$.
When you use Baum-Welch to estimate the parameters, you are in fact looking for a maximum-likelihood estimate of the HMM.
If you use Kalman filter, you are solving a special case of Bayesian filter problem (which is in fact an application of Bayes' theorem on update step):
Prediction step:
$\displaystyle P(x_t|y_{1:t-1}) = \int P(x_t|x_{t-1})P(x_{t-1}|y_{1:t-1}) \, dx_{t-1}$
Update step:
$\displaystyle P(x_t|y_{1:t}) = \frac{P(y_t|x_t)P(x_t|y_{1:t-1})}{\int P(y_t|x_t)P(x_t|y_{1:t-1}) \, dx_t}$
In Kalman filter, since we assume the noise statistic is Gaussian and the relationship of $P(x_t|x_{t-1})$ and $P(y_t|x_t)$ are linear. Therefore you can write $P(x_t|y_{1:t-1})$ and $P(x_t|y_{1:t})$ simply as the $x_t$ (mean + variance is sufficient for normal distribution) and the algorithm works as matrix formulas.
On the other hand, for other stateless model you mentioned, like SVM, splines, regression trees, nearest neighbors. They are trying to discover the underlying relationship of $(\{y_0,y_1,...,y_{t-1}\}, y_t)$ by empirical risk minimization.
For maximum-likelihood estimation, you need to parametrize the underlying probability distribution first (like HMM, you have the transition matrix, the observable are $(\mu_j,\sigma_j)$ for some $j$)
For application of Bayes' theorem, you need to have "correct" a priori $P(A)$ first in the sense that $P(A) \neq 0$. If $P(A)=0$, then any inference results in $0$ since $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$.
For empirical risk minimization, universal consistency is guaranteed for any underlying probability distribution if the VC dimension of the learning rule is not growing too fast as the number of available data $n \to \infty$
How you divide your data set into training/test depends on the data you have available and how your model will be used. Ideally you wouldn't randomly separate the time-points, since as you say, they are not independent if there is any temporal signal at all.
If you have multiple time-series then I'd divide the time-series themselves into training and test in whatever fashion you want.
If your training data is a single time-series and you intend to predict future values of this time-series then I'd segment it accordingly. I.e. use the first 60% of the samples as your training data and the remaining 40% as your test. Of course, these sets aren't independent but given the nature of your data this is unavoidable.
If you have a single time-series for training but the actual time-series that you want to predict future values for is entirely separate then I'd still follow the procedure from the above paragraph, but bear in mind that any estimates of model fit you derive are very likely to be inaccurate.
As an aside, I would be tempted to use a Recurrent Neural Network approach to a problem like this. This would allow you to model the temporal aspect elegantly - something like an an LSTM can maintain a memory of previous values without having to explicitly specify a window size. Of course, if you wish to use a window approach then you could in theory use any classification algorithm you want.
EDIT
That technical paper looks to have covered the issue in far more depth than my answer so I'd follow their recommendations instead. The main difference from your use case is that they are dealing with standard forecasting the next sample, rather than classification
Best Answer
You're correct, you are not getting only one model, but the performance of many. This is what happens every time we do cross-validation in usual (non time-series related) problems. In short, this is cross-validation.
The figure you posted depicts the usual way to do cross-validation in time series. The principle is that you should always evaluate your model with future data. You could also do that without superposition of intervals, but then you'd have weaker estimates.