There are two common ways of utilizing the maximum-margin hyperplane of a trained SVM.
(1) Prediction for new data points
Based on a given training dataset, an SVM hyperplane is fully specified by its slope $w$ and an intercept $b$. (These variable names derive from a tradition established in the neural-networks literature, where the two respective quantities are referred to as 'weight' and 'bias.') As noted before, a new data point $x \in \mathbb{R}^d$ can then be classified as
\begin{align}
f(x) = \textrm{sgn}\left(\langle w,x \rangle + b \right)
\end{align}
where $\langle w,x \rangle$ represents the inner product. Thanks to the Karush-Kuhn-Tucker complementarity conditions, the discriminant function can be rewritten as
\begin{align}
f(x) = \sum_{i \in SV} \alpha_i \langle x_i, x \rangle + b,
\end{align}
where the hyperplane is implicitly encoded by the support vectors $x_i$, and where $\alpha_i$ are the support vector coefficients. The support vectors are those training data points which are closest to the separating hyperplane. Thus, predictions can be made very efficiently, since only inner products (alternatively: kernel functions) between some training points and the test point have to be evaluated.
Some have suggested also considering the distance between a new data point $x$ and the hyperplane, as an indicator of how confident the model was in its prediction. However, it is important to note that hyperplane distance itself does not afford inference; there is no probability associated with a new prediction, which is why an SVM is sometimes referred to as a point classifier. If probabilistic output is desired, other classifiers may be more appropriate, e.g., the SVM's probabilistic cousin, the relevance vector machine (RVM).
(2) Reconstructing feature weights
There is another way of putting an SVM model to use. In many classification analyses it is interesting to examine which features drove the classifier, i.e., which features played the biggest role in shaping the separating hyperplane. Given a trained SVM model with a linear kernel, these feature coefficients $w_1, \ldots, w_d$ can be reconstructed easily using
\begin{align}
w = \sum_{i=1}^n y_i \alpha_i x_i
\end{align}
where $x_i$ and $y_i$ represent the $i^\textrm{th}$ training example and its corresponding class label.
An important caveat of this approach is that the resulting feature weights are simple numerical coefficients without inferential quality; there is no measure of confidence associated with them. Thus, we cannot readily argue that some features were 'more important' than others, and we cannot infer that a feature with a particularly low coefficient was 'not important' in the classification problem. In order to allow for inference on feature weights, we would need to resort to more general-purpose approaches, such as the bootstrap, a permutation test, or a feature-selection algorithm embedded in a cross-validation scheme.
In short, I think they are working in different learning paradigm.
State-space model (hidden state model) and other stateless model you mentioned are going to discover the underlying relationship of your time series in different learning paradigm: (1) maximum-likelihood estimation, (2) Bayes' inference, (3) empirical risk minimization.
In state-space model,
Let $x_t$ as the hidden state, $y_t$ as the observables, $t>0$ (assume there is no control)
You assume the following relationship for the model:
$P(x_0)$ as a prior
$P(x_t | x_{t-1})$ for $t \geq 1$ as how your state change (in HMM, it is a transition matrix)
$P(y_t | x_t)$ for $t \geq 1$ as how you observe (in HMM, it could be normal distributions that conditioned on $x_t$)
and $y_t$ only depends on $x_t$.
When you use Baum-Welch to estimate the parameters, you are in fact looking for a maximum-likelihood estimate of the HMM.
If you use Kalman filter, you are solving a special case of Bayesian filter problem (which is in fact an application of Bayes' theorem on update step):
Prediction step:
$\displaystyle P(x_t|y_{1:t-1}) = \int P(x_t|x_{t-1})P(x_{t-1}|y_{1:t-1}) \, dx_{t-1}$
Update step:
$\displaystyle P(x_t|y_{1:t}) = \frac{P(y_t|x_t)P(x_t|y_{1:t-1})}{\int P(y_t|x_t)P(x_t|y_{1:t-1}) \, dx_t}$
In Kalman filter, since we assume the noise statistic is Gaussian and the relationship of $P(x_t|x_{t-1})$ and $P(y_t|x_t)$ are linear. Therefore you can write $P(x_t|y_{1:t-1})$ and $P(x_t|y_{1:t})$ simply as the $x_t$ (mean + variance is sufficient for normal distribution) and the algorithm works as matrix formulas.
On the other hand, for other stateless model you mentioned, like SVM, splines, regression trees, nearest neighbors. They are trying to discover the underlying relationship of $(\{y_0,y_1,...,y_{t-1}\}, y_t)$ by empirical risk minimization.
For maximum-likelihood estimation, you need to parametrize the underlying probability distribution first (like HMM, you have the transition matrix, the observable are $(\mu_j,\sigma_j)$ for some $j$)
For application of Bayes' theorem, you need to have "correct" a priori $P(A)$ first in the sense that $P(A) \neq 0$. If $P(A)=0$, then any inference results in $0$ since $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$.
For empirical risk minimization, universal consistency is guaranteed for any underlying probability distribution if the VC dimension of the learning rule is not growing too fast as the number of available data $n \to \infty$
Best Answer
SVM's are particularly ill-suited for time prediction analysis due to stationarity assumptions. You can circumvent these restrictions by clever use of feature or kernel engineering essential rendering the data distribution stationary though. I'd suggest to take a look at ARIMA related literature & models to see some of the typical "terms" that are included.
Specifically, you could try incorporating:
The latter should easily solve your problem afaik.