Time Series – How to Best Evaluate a Time Series Prediction Algorithm

machine learningpredictionpredictive-models

What's best-practice for training and evaluating a prediction algorithm on a time series?

For learning algorithms that are trained in batch mode, a naive programmer might give the raw dataset of [(sample, expected prediction),...] directly to the algorithm's train() method. This will usually show an artificially high success rate because the algorithm will effectively be "cheating" by using future samples to inform predictions made on earlier samples. When you actually try to use the trained model to predict new data in real-time, it'll probably perform terribly, since it no longer has any future data to rely on.

My current approach is to train and evaluate as you might in real-time. For N training samples, ordered chronologically, where each sample is a tuple composed of the input A and the expected prediction output B, I input A into my algorithm and get the actual result C. I compare this to B and record the error. Then I add the sample to the local "past" subset and do a batch train a new model on just the subset. I then repeat this process for each training sample.

Or, to put it in pseudo-code:

predictor = Predictor()
training_samples = []
errors = []
for sample in sorted(all_samples, key=lambda o: o.date):
    input_data, expected_prediction = sample

    # Test on current test slice.
    actual_prediction = predictor.predict(input_data)
    errors.append(expected_prediction == actual_prediction)

    # Re-train on all "past" samples relative to the current time slice.
    training_samples.append(sample)
    predictor = Predictor.train(training_samples)

This seems very thorough, since it simulates what a user would be forced to do if they had to make a prediction at each time step, but clearly, for any large dataset, would be terribly slow, since you're multiplying the algorithm's training time (which for many algorithms and large datasets is high) by every sample.

Is there a better approach?

Best Answer

What you are proposing is known as a "rolling origin" evaluation in the forecasting literature. And yes, this method of evaluating forecasting algorithms is very widely used.

If you find that performance is a bottleneck, you could do subsampling. Don't use every possible origin. Instead, use, e.g., every fifth possible origin. (Make sure you don't introduce unwanted confounding between your subsampled origins and seasonality in the data. For instance, if you use daily data, don't use every seventh day as an origin, because then you are really only assessing forecasting quality on Tuesdays, or only on Thursdays etc.)

Then again, you don't really need to train your model again from scratch every time you roll the origin forward. Start out from the last trained model. (For example, in Exponential Smoothing, simply update your components with the new data since the last training.) This should dramatically cut down on your overall training time.

Related Question