Solved – Learning user behavior that changes over time

machine learningsvmweighted-sampling

I am learning a model using SVM that will predict user behavior of some kind.

Simplifying this model, each example in the feature space contains some features: $f_1,f_2,..,f_n$ and a class $a$ that represents the action that the user eventually took in that example (which I will try to predict).

I have the examples ordered by the chronological order of their respective actions.

I did two experiments (On pretty large set of users, learning one model per user):

  1. Use the 50% first examples (chronologically) for training SVM, use the other 50% for testing it. The results – some of the features outperform the SVM model (!!!)
  2. Use randomly chosen 50% for training SVM, the rest for testing – the SVM model outperforms all features as expected.

My conclusion is that the user behavior changes over time (the examples were collected over years), what means that the first 50% are generated from somewhat different distribution than the last 50%.. what causes very poor model accuracy.
In practice I am still interested in the first method and not the second one, because in reality I will have only past examples for training and will try to predict future example, so I need to improve my learning process.

Is there any known solution/common approach for this problem? some model/algorithm which gives more weight to newer examples and might fix/mitigate the major problem I'm having?

Best Answer

There are three main solutions that come to mind, two of which are highly related:

Non-stationary Distributions: Online learning This is a great case for time-series learning with a non-stationary distribution. This is the cutting edge and there are only a handful of people even studying this. Here is a link to the most recent research on the topic: http://www.cs.nyu.edu/~mohri/pub/nonstat.pdf

The main take away here is that if your testing distribution is very different from your training distribution, then performance will be bad. However, this is a function of how quickly your distributions change. SVMs make distributional assumptions that are inappropriate for this task. Namely, you assume that there is a stationary distribution over time. The best solution is to use an online learner, which doesn't make assumptions about the distribution. Here are some slides that go overt online learning in detail: http://www.cims.nyu.edu/~mohri/amls/lecture_4.pdf

Online learning uses the notion of regret over "experts" in order to make predictions. These weights change over time, as the algorithm makes mistake. In this case, you could have an "expert" for each feature and each combination of features. Then, you hand the algorithm data points, one after the other and update the weights of each of the experts in hindsight. This will allow you to very easily follow the changes of your distribution.

Domain Adaptation Depending on your features, this might be a good place to use domain adaptation. Loosely, domain adaptation is used when the features in your training set are different from the features in your test set. This requires you to use the unlabeled test data to put weights on your training data. In other words, if $f_1$ is occurs often in your training set but only very sparsely in your test set, then you might want to lower the weight of $f_1$. This might be the case if some of the SVM features are outperforming the model. For an extensive review: http://www.cs.nyu.edu/~mohri/pub/nsmooth.pdf (this review goes over the case of regression, but can be easily adapted to multi-class classification).

You will have to look at your data, but do you find that the features themselves change over time (or the distribution of those features, rather)? If so, you can use something like importance weighting, where the training features are weighted based on their likelihood of occurring in the test set. For further discussion on the topic: http://www.cs.nyu.edu/~rostami/papers/nadap.pdf and http://papers.nips.cc/paper/4156-learning-bounds-for-importance-weighting.pdf

Note: These are fairly advanced papers - this is at the cutting edge of machine learning research, so you will have to make the algorithms yourself. Luckily there is pseudocode available in the papers.

Reinforcement Learning The way you posed the problem made me immediately think of reinforcement learning. In this case, you want to learn the likelihood of an action based on the features. This could be constructed using a Markov Decision Process. This is valuable because it can quickly learn despite distributional changes. In fact, the learning rate $\alpha$ can determine the amount of "memory" in your learner. For a comprehensive discussion and extensive literature review, look at these slides: http://www.cs.nyu.edu/~mohri/mls/lecture_11.pdf

I hope this helps, let me know if any of these sound like they might apply to your case and I can elaborate.