I am using a multi-layer perceptron neural network to try to predict the outcomes of football matches, using 20 years worth of match results and statistics. I am using 10-fold cross validation in Weka, and getting results such as 83% correctly classified instances for my chosen inputs and parameters. I then train the model on the entire data set.
But when I apply the model to try to predict real world results of future matches (over the last 10 rounds), my model is only achieving an accuracy of 52%, barely better than chance.
In theory, if the cross validation results in 83% accuracy, shouldn't I expect the same model to achieve roughly the same accuracy in future predictions?
Edit: I also trained my network on 20 years worth of match data up to and including 2016 (i.e. excluding any 2017 data). Then I test the resulting model on a separate test set comprising 2017 match data, which was not used at all in the training, and the network achieves a 82% accuracy. And yet in practice, I have only achieved 52% prediction accuracy. I still don't understand that discrepancy.
Best Answer
You are using cross-validation wrong. You just split all your data into 10 folds, which means in the training fold, there are events from every year, so it learns with 90% of the data of each year. When it predicts, it predicts the remaining 10% of a year from which it has already seen 90%. When you then predict future games, the classifier has not yet seen any data from that year (or course).
Or think of it like this: you want to predict the temperature of a certain day for the next year. If you use data from the last 20 years and split it, it is easy of course for the classifier to predict day x in the test sample if it has already seen x-1, x-2, x+1 (the days around day x). So it learns just to predict the next/past few days. It does newer learn to predict the next year, say to use the previous years to infer on the temperature on the same day a year later. The days before/after day x are way more useful.
I think as the problem is clear now, let's go to the solution.
So how to do unbiased cross-validation on time series:
This should yield more realistic performances and help you fight the current over-fit in your model.