Solved – Neural network prediction accuracy doesn’t match real world results

accuracycross-validationneural networksprediction

I am using a multi-layer perceptron neural network to try to predict the outcomes of football matches, using 20 years worth of match results and statistics. I am using 10-fold cross validation in Weka, and getting results such as 83% correctly classified instances for my chosen inputs and parameters. I then train the model on the entire data set.

But when I apply the model to try to predict real world results of future matches (over the last 10 rounds), my model is only achieving an accuracy of 52%, barely better than chance.

In theory, if the cross validation results in 83% accuracy, shouldn't I expect the same model to achieve roughly the same accuracy in future predictions?

Edit: I also trained my network on 20 years worth of match data up to and including 2016 (i.e. excluding any 2017 data). Then I test the resulting model on a separate test set comprising 2017 match data, which was not used at all in the training, and the network achieves a 82% accuracy. And yet in practice, I have only achieved 52% prediction accuracy. I still don't understand that discrepancy.

Best Answer

You are using cross-validation wrong. You just split all your data into 10 folds, which means in the training fold, there are events from every year, so it learns with 90% of the data of each year. When it predicts, it predicts the remaining 10% of a year from which it has already seen 90%. When you then predict future games, the classifier has not yet seen any data from that year (or course).

Or think of it like this: you want to predict the temperature of a certain day for the next year. If you use data from the last 20 years and split it, it is easy of course for the classifier to predict day x in the test sample if it has already seen x-1, x-2, x+1 (the days around day x). So it learns just to predict the next/past few days. It does newer learn to predict the next year, say to use the previous years to infer on the temperature on the same day a year later. The days before/after day x are way more useful.

I think as the problem is clear now, let's go to the solution.

So how to do unbiased cross-validation on time series:

  • do NOT use the same year in testing as well as training.
  • You also don't want to use newer data as the one you will predict, because this is not a real-world case...
  • take, let's say, the first 10 years, train, and predict year 11. Then train on the first 11 years, predict on 12 and so on. This will give you an estimation of how well the model is able to predict outcomes as well as how much it improves with more data. In the end, when you optimized your network, you can train on the full data sample.

This should yield more realistic performances and help you fight the current over-fit in your model.

Related Question