In short, I think they are working in different learning paradigm.
State-space model (hidden state model) and other stateless model you mentioned are going to discover the underlying relationship of your time series in different learning paradigm: (1) maximum-likelihood estimation, (2) Bayes' inference, (3) empirical risk minimization.
In state-space model,
Let $x_t$ as the hidden state, $y_t$ as the observables, $t>0$ (assume there is no control)
You assume the following relationship for the model:
$P(x_0)$ as a prior
$P(x_t | x_{t-1})$ for $t \geq 1$ as how your state change (in HMM, it is a transition matrix)
$P(y_t | x_t)$ for $t \geq 1$ as how you observe (in HMM, it could be normal distributions that conditioned on $x_t$)
and $y_t$ only depends on $x_t$.
When you use Baum-Welch to estimate the parameters, you are in fact looking for a maximum-likelihood estimate of the HMM.
If you use Kalman filter, you are solving a special case of Bayesian filter problem (which is in fact an application of Bayes' theorem on update step):
Prediction step:
$\displaystyle P(x_t|y_{1:t-1}) = \int P(x_t|x_{t-1})P(x_{t-1}|y_{1:t-1}) \, dx_{t-1}$
Update step:
$\displaystyle P(x_t|y_{1:t}) = \frac{P(y_t|x_t)P(x_t|y_{1:t-1})}{\int P(y_t|x_t)P(x_t|y_{1:t-1}) \, dx_t}$
In Kalman filter, since we assume the noise statistic is Gaussian and the relationship of $P(x_t|x_{t-1})$ and $P(y_t|x_t)$ are linear. Therefore you can write $P(x_t|y_{1:t-1})$ and $P(x_t|y_{1:t})$ simply as the $x_t$ (mean + variance is sufficient for normal distribution) and the algorithm works as matrix formulas.
On the other hand, for other stateless model you mentioned, like SVM, splines, regression trees, nearest neighbors. They are trying to discover the underlying relationship of $(\{y_0,y_1,...,y_{t-1}\}, y_t)$ by empirical risk minimization.
For maximum-likelihood estimation, you need to parametrize the underlying probability distribution first (like HMM, you have the transition matrix, the observable are $(\mu_j,\sigma_j)$ for some $j$)
For application of Bayes' theorem, you need to have "correct" a priori $P(A)$ first in the sense that $P(A) \neq 0$. If $P(A)=0$, then any inference results in $0$ since $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$.
For empirical risk minimization, universal consistency is guaranteed for any underlying probability distribution if the VC dimension of the learning rule is not growing too fast as the number of available data $n \to \infty$
Note 1: It is better to split your data into three sets instead of two. That is, split your data into training set, validation set, and test set. The validation set is used in step 2 of your diagram, where you train a model on your training set and evaluate the performance of your hyperparameters based on the validation set. It is not true to test the hyperparameters based on training set since it will probably find the hyperparameter that fits your training data, hence causes overfitting problem.
Note 2: The evaluations made in steps 2 and 4 depends on the particular split of your data. Because of this, it is suggested to (i) shuffle your data before splitting, (b) repeat your splitting many times and then report the average performance over these trials. Another solution is to use k-fold cross validation. The most accurate performance evaluation is obtained by leave one out method. The leave one out is really slow, and in the real world applications, it is usually enough to use 10-fold cross validation.
Note 3: You are right that if the training distribution is different from test distribution, the learned model and the evaluated performance is not suitable for test data. The topic of this issue in machine learning is domain adaptation.
Note 4: Since your data is imbalanced binary, I recommend using f-measure. The precision, recall, and accuracy are not enough to evaluate the performance of a learned model for imbalanced data.
Finally, "Andrew Ng", a well-known researcher in machine learning is working on a book called "Machine Learning Yearning". This book focuses on practical aspects of machine learning. Each chapter has around 2 pages. As I know, 12 chapters of this book is released. These chapters are related to your questions and I recommend you to read them.
Best Answer
Small datasets and few features are a domain where traditional statistical models tend to do very well, because they offer the ability to actually interpret the importance of your features.
I'm assuming by "simple regression" you mean predicting a real-valued, continuous variable y from your input variables. You mention that you suspect you may have non-linear relationships, and you don't know much about the importance of the features.
My instincts in this case would be to use a generalized additive model (GAM), like the mgcv package for R. mgcv has very nice default methods for choosing some of the more arcane parameters in a GAM, like how many knots and where to put them.
Maybe you have three predictors, x1, x2, and x3, where x1 and x2 are continuous and x3 is a categorical variable. In this case you could do (in R):
That last part about using REML is personal preference. It sets how "wiggly" the nonlinear curves are allowed to be. The default method uses, if I recall, generalized cross-validation, which works fine though in my experience tends to give "wigglier" curves.