Your understanding about sliding window analysis is generally correct. You may find it helpful to separate the model validation process from the actual forecasting. In model validation, you use $k$ instances to train a model that predicts "one step" forward. Make sure each of your $k$ instances uses only information available at that particular time. This can be subtle, because it is easy to accidentally peek ahead into the future and pollute your out-of-sample test.
For example, you might accidentally use the entire time series history in feature selection, and then use those features to test the model at every step of time. This is cheating, and will give you an overestimate of accuracy. This is mentioned in Elements of Statistical Learning, but outside the sliding window time series context.
It is also easy to accidentally pollute with future information if some of your independent variables are asset returns. Say I use the return on an asset from time $t=21$ days to $t=28$ days to test at $t=21$ days. In this case, I have also polluted the out-of-sample test. Instead I would want to train with instances up to $t=21$ days, and test with one step at $t=28$ days.
When you have validated your model, and are happy with the parameters and feature selection, then you typically train with all of your data and forecast into the actual future.
If you want to include past space-time lags in your VAR model this is perfectly reasonable (as past spatial lags are exogenous, the same logic for past time periods of $y$ are exogenous). I would guess the most usual way to accomplish this is to include a $Wy_{t-1}$ term in the model, which is the column vector obtained when you pre-multiply the time lag, $y_{t-1}$ by $W$ (which is your a priori specified spatial weights matrix). Although certainly an over-simplification in most realistic circumstances, you still have all the usual time-series models availables if you go this route and wouldn't be too arduous to code up yourself.
If you want an endogenous spatial lag in the model (i.e. $Wy_{t}$), this might involve building an appropriate spatial weights matrix, and then using the usual means to estimate models with endogenous spatial lags. An "appropriate" spatial weights matrix would look like $I_t \otimes W$ where $I_t$ is an Identity matrix with the number of rows and columns equal to the number of time periods, and $\otimes$ is the Kronecker product.
Here is a brief example in R of what such a block spatial weights matrix would like with a binary $W$;
> t <- diag(3)
> w <- matrix(c(0,1,0,
+ 1,0,1,
+ 0,1,0), nrow = 3)
>
> t %x% w
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 1 0 0 0 0 0 0 0
[2,] 1 0 1 0 0 0 0 0 0
[3,] 0 1 0 0 0 0 0 0 0
[4,] 0 0 0 0 1 0 0 0 0
[5,] 0 0 0 1 0 1 0 0 0
[6,] 0 0 0 0 1 0 0 0 0
[7,] 0 0 0 0 0 0 0 1 0
[8,] 0 0 0 0 0 0 1 0 1
[9,] 0 0 0 0 0 0 0 1 0
(This also shows an easy way to generate the space time lags, if you pre-multiply $W$ by an identity matrix where all the 1's are shifted down 1 row that will produce the $Wy_{t-1}$ column vector).
This has the negative that the matrix is huge, so it might not even be feasible to estimate this model. Also, as far as I'm aware, there isn't much current code floating around for complicated space-time models, so I would off-the-cuff say the time series aspect is somewhat limited to including AR or simple trend terms unless you want to code up your own estimators as well (I would love for people to correct me and point to working code libraries/examples if I am wrong).
I would suggest you get modelling/coding motivation from Lesage and Pace's Matlab toolbox. Also there book goes into great detail about coding up spatial models (and is language agnostic) and so if your serious about rolling your own it would be highly recommended.
Also FYI, I would suggest you utilize the handy functions in the python library pysal to implement your own STAR library if you so desire, they have all the annoying stuff about generating spatial weights matrices already taken care of. Also I suspect they are a good group to ask who is developing space-time models and if any working code is already available.
It probably should be mentioned as well that in some fields (e.g. epidemiology) it is popular to fit Bayesian models and estimate the spatial terms via MCMC. I am admittedly less familiar with this though, and so would just point to the GeoBugs project where one might find examples (I can scrounge up some examples from my library requested).
Best Answer
We can refer to this paper, and explications below sum up approach in this paper.
First, autoregressive models can be described as follows.
Model for time series
Given a temporal sequence of vaiables, $Y=(Y_{1},...,Y_{T})$, a time series is a sequence of values for these variables, $y=(y_{1},...,y_{T})$. If $f(.|.,\theta)$ is a probability distribution or the model, we retict to models with form
$ p(y_{t}|y_{1},...,y_{t-1},\theta) = f(y_{t}|y_{t-p},...,y_{t-1},\theta)$
Model is probabilistic, stationary, and has p-Markov property.
Autoregressive Tree Model
First, an AR model is of the form
$f(y_{t}|y_{t-p},...,y_{t-1},\theta) = \mathit{N}( m + \sum_{j=1}^{p}b_{j}y_{t-j}, \sigma^{2}) $
where $\mathit{N}(\mu,\theta)$ is normal distribution with obvious notation.
That is, at each time, probability for a value has mean 'autoregressively' dependent of the last p values for the series.
An ART model is an AR model that is piecewise linear, and therefore can be represented as a tree. Each non leaf is a boolean formula, and each leaf is an AR model.
This is simple: branching along the tree operates depending on past values for the series. Each leaf is then an AR model for predicting the next time series value.
An AR model is a degenerated ART model, where there is one 'boolean' decision node, and one leaf AR model.
ART model over AR model
An alternative for ART are neural networks BUT they are difficult to interpret and/or expensive to learn.