There are three main solutions that come to mind, two of which are highly related:
Non-stationary Distributions: Online learning
This is a great case for time-series learning with a non-stationary distribution. This is the cutting edge and there are only a handful of people even studying this. Here is a link to the most recent research on the topic:
http://www.cs.nyu.edu/~mohri/pub/nonstat.pdf
The main take away here is that if your testing distribution is very different from your training distribution, then performance will be bad. However, this is a function of how quickly your distributions change. SVMs make distributional assumptions that are inappropriate for this task. Namely, you assume that there is a stationary distribution over time. The best solution is to use an online learner, which doesn't make assumptions about the distribution. Here are some slides that go overt online learning in detail: http://www.cims.nyu.edu/~mohri/amls/lecture_4.pdf
Online learning uses the notion of regret over "experts" in order to make predictions. These weights change over time, as the algorithm makes mistake. In this case, you could have an "expert" for each feature and each combination of features. Then, you hand the algorithm data points, one after the other and update the weights of each of the experts in hindsight. This will allow you to very easily follow the changes of your distribution.
Domain Adaptation
Depending on your features, this might be a good place to use domain adaptation. Loosely, domain adaptation is used when the features in your training set are different from the features in your test set. This requires you to use the unlabeled test data to put weights on your training data. In other words, if $f_1$ is occurs often in your training set but only very sparsely in your test set, then you might want to lower the weight of $f_1$. This might be the case if some of the SVM features are outperforming the model. For an extensive review: http://www.cs.nyu.edu/~mohri/pub/nsmooth.pdf (this review goes over the case of regression, but can be easily adapted to multi-class classification).
You will have to look at your data, but do you find that the features themselves change over time (or the distribution of those features, rather)? If so, you can use something like importance weighting, where the training features are weighted based on their likelihood of occurring in the test set. For further discussion on the topic: http://www.cs.nyu.edu/~rostami/papers/nadap.pdf and http://papers.nips.cc/paper/4156-learning-bounds-for-importance-weighting.pdf
Note: These are fairly advanced papers - this is at the cutting edge of machine learning research, so you will have to make the algorithms yourself. Luckily there is pseudocode available in the papers.
Reinforcement Learning
The way you posed the problem made me immediately think of reinforcement learning. In this case, you want to learn the likelihood of an action based on the features. This could be constructed using a Markov Decision Process. This is valuable because it can quickly learn despite distributional changes. In fact, the learning rate $\alpha$ can determine the amount of "memory" in your learner.
For a comprehensive discussion and extensive literature review, look at these slides: http://www.cs.nyu.edu/~mohri/mls/lecture_11.pdf
I hope this helps, let me know if any of these sound like they might apply to your case and I can elaborate.
What you are proposing is known as a "rolling origin" evaluation in the forecasting literature. And yes, this method of evaluating forecasting algorithms is very widely used.
If you find that performance is a bottleneck, you could do subsampling. Don't use every possible origin. Instead, use, e.g., every fifth possible origin. (Make sure you don't introduce unwanted confounding between your subsampled origins and seasonality in the data. For instance, if you use daily data, don't use every seventh day as an origin, because then you are really only assessing forecasting quality on Tuesdays, or only on Thursdays etc.)
Then again, you don't really need to train your model again from scratch every time you roll the origin forward. Start out from the last trained model. (For example, in Exponential Smoothing, simply update your components with the new data since the last training.) This should dramatically cut down on your overall training time.
Best Answer
Your problem seems to be well suited for online learning.
You can use stochastic or mini-batch gradient descent to train a neural network continuously over time.
In stochastic gradient descent, you take one gradient descent step each time you get a new training example. On mini-batch, you do it each time you gather a batch of
n
training examples.Since your local minima changes over time, your neural net will continuously adapt its weights as newer data comes in, fitting the new local minimum.
You can also play around with the step size. The larger it is, the more it will effectively weight more recent data (e.g. the easier to escape from the older local minimum) but could also become super unstable. Worth tuning as a hyperparameter and running backtests in time.
You can watch this video by Andrew Ng for an example of online learning: https://www.youtube.com/watch?v=dnCzy_XKGbA