Solved – High predicted R2 in training data but low accuracy on test data set

I'm building a regression model to predict airline prices using python. I have three inputs mean , median and no of days from departure, and dataset greater than 100000 data points crawled at daily intervals. I used the log of the input variables and divided the dataset into 80:20 for training and and testing purposes.Im getting a high predicted R2 (greater than 80 percent) on training data. However the accuracy (predicted data points within 5 percent deviation of the actual price) is neither good on 20 percent dataset ie.20000 data points which are within the same route as training, on different departure dates and neither on totally unforeseen routes.
Why is there a mismatch between R2 predicted and accuracy ? Do i need to compute the Root Mean Squared Error instead of calculating the number of predicted values within 5 % deviation of the observed values to know the prediction accuracy ?

I've constructed a X.Y scatter of predicted vs actual. I reduced the data set to 510 data points and used 489 data points for training, 21 data points from the same route and another 63 data points for testing on different route .This time the R2 predicted dropped to 60 percent on training set. Results are as follows

Graph 1 is where test data is from the same route. ie 510-489 = 21 data points. Accuracy is horrible both in terms of
a) No of predicted data points within 5 percent of observed (14.28 percent) as well as
b) Root mean squared error which is 2979 which is huge.
Predicted vs Actual – Bombay – Calcutta – Test data points are same route as training.

Graph 2 is where the test data points are 63 in number for Ahmadabad – Calcutta which is a totally different route.

This time accuracy is slightly improved.
a) No of predicted data points within 5 percent of observed (33 percent) as well as
b) Root mean squared error which is 2246 which is huge.
Predicted vs Actual – Bombay – Calcutta – Test data points are same route as training

Best Answer

This is classic overfitting. Your model has a high R2 because it's fitted to non-predictive noise in your training data and consequently doesn't generalize well, hence the low accuracy on your test set. Try training on more features if you can find data. Otherwise, consider trying different families of models and/or applying regularization.

Best Answer

Related Solutions

Solved – Linear model- Understanding performances on training and test sets

Solved – neural network for regression – actual data vs predicted data