The open-source project vowpal wabbit includes an implementation of online SGD which is enhanced by on the fly (online) computation of 3 additional factors affecting the weight updates. These factors can be enabled/disabled by their respective command line options (by default all three are turned on, the --sgd
option, turns them all off, i.e: falls-back on "classic" SGD).
The 3 SGD enhancing options are:
--normalized
updates adjusted for scale of each feature
--adaptive
uses adaptive gradient (AdaGrad) (Duchi, Hazan, Singer)
--invariant
importance aware updates (Karampatziakis, Langford)
Together, they ensure that the online learning process does a 3-way automatic compensation/adjustment for:
- per-feature scaling (large vs small values)
- per-feature learning rate decay based on feature importance
- per feature adaptive learning rate adjustment for feature prevalence/rarity in examples
The upshot is that there's no need to pre-normalize or scale different features to make the learner less biased and more effective.
In addition, vowpal wabbit also implements online regularization via truncated gradient descent with the regularization options:
--l1
(L1-norm)
--l2
(L2-norm)
My experience with these enhancements on multiple data-sets, was that they significantly improved model accuracy and smoother convergence when each of them was introduced into the code.
Here are some academic papers for more detail related to these enhancements:
There are three main solutions that come to mind, two of which are highly related:
Non-stationary Distributions: Online learning
This is a great case for time-series learning with a non-stationary distribution. This is the cutting edge and there are only a handful of people even studying this. Here is a link to the most recent research on the topic:
http://www.cs.nyu.edu/~mohri/pub/nonstat.pdf
The main take away here is that if your testing distribution is very different from your training distribution, then performance will be bad. However, this is a function of how quickly your distributions change. SVMs make distributional assumptions that are inappropriate for this task. Namely, you assume that there is a stationary distribution over time. The best solution is to use an online learner, which doesn't make assumptions about the distribution. Here are some slides that go overt online learning in detail: http://www.cims.nyu.edu/~mohri/amls/lecture_4.pdf
Online learning uses the notion of regret over "experts" in order to make predictions. These weights change over time, as the algorithm makes mistake. In this case, you could have an "expert" for each feature and each combination of features. Then, you hand the algorithm data points, one after the other and update the weights of each of the experts in hindsight. This will allow you to very easily follow the changes of your distribution.
Domain Adaptation
Depending on your features, this might be a good place to use domain adaptation. Loosely, domain adaptation is used when the features in your training set are different from the features in your test set. This requires you to use the unlabeled test data to put weights on your training data. In other words, if $f_1$ is occurs often in your training set but only very sparsely in your test set, then you might want to lower the weight of $f_1$. This might be the case if some of the SVM features are outperforming the model. For an extensive review: http://www.cs.nyu.edu/~mohri/pub/nsmooth.pdf (this review goes over the case of regression, but can be easily adapted to multi-class classification).
You will have to look at your data, but do you find that the features themselves change over time (or the distribution of those features, rather)? If so, you can use something like importance weighting, where the training features are weighted based on their likelihood of occurring in the test set. For further discussion on the topic: http://www.cs.nyu.edu/~rostami/papers/nadap.pdf and http://papers.nips.cc/paper/4156-learning-bounds-for-importance-weighting.pdf
Note: These are fairly advanced papers - this is at the cutting edge of machine learning research, so you will have to make the algorithms yourself. Luckily there is pseudocode available in the papers.
Reinforcement Learning
The way you posed the problem made me immediately think of reinforcement learning. In this case, you want to learn the likelihood of an action based on the features. This could be constructed using a Markov Decision Process. This is valuable because it can quickly learn despite distributional changes. In fact, the learning rate $\alpha$ can determine the amount of "memory" in your learner.
For a comprehensive discussion and extensive literature review, look at these slides: http://www.cs.nyu.edu/~mohri/mls/lecture_11.pdf
I hope this helps, let me know if any of these sound like they might apply to your case and I can elaborate.
Best Answer
The streaming setting in machine learning is called "online learning". There is no exact support vector machine in the online setting (since the definition of the objective function is inherently for the batch setting). Probably the most straightforward generalization of the SVM to the online setting are passive-aggressive algorithms. Code is here http://webee.technion.ac.il/people/koby/code-index.html and an associated paper is here http://eprints.pascal-network.org/archive/00002147/01/CrammerDeKeShSi06.pdf
The basic idea is that one recieves data as $(\mathbf{x},y)\in\mathbb{R}^d\times [k]$ pairs with query points $\mathbf{x}\in \mathbb{R}$ where $k$ is the number of labels. The algorithm maintains a weight matrix $W_t\in \mathbb{R}^{k\times d}$ at iteration $t$ the algorithms recieves a data point $\mathbf{x}_t$ and then gives predicted scores $\hat{\mathbf{y}}_t=W\mathbf{x}_t$ for each of the labels and it predicts the highest scoring label as the true label. If the prediction is wrong then the algorithm makes the smallest change to $W_t$ such that it will avoid that mistake in the future. Smallest change is here defined in terms of the Frobenius norm.