Solved – Why the RMSE of training is very small but the test error is very big

gaussian processoverfittingregression

I am a beginner in machine learning , and I use Matlab machine learning toolbox to build machine learning model.

I have some question about my model. My input data has 20 predictors and 1 response.

I use Gaussian Process Regression model with Squared exponential kernel function, and I also use 10 fold cross validation.

By this way the RMSE of training is very small. However, when I output the model I use some test data to predict, the error of test is so big.

How do I improve my model? I would appreciate it if someone answered me…

Best Answer

I think the problem that you get is an Overfitting in the model which you created.

When you are creating a predictive model, what actually you are doing is create the model that captures the signal not the noise of the data.

RMSE of training of model is a metric which measure how much the signal and the noise is explained by the model

So when you add more variables to a model, the model become more "flexible", it captures the pattern of training data very well and reduces RMSE to a smallest amount, however because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f which you are trying to estimate . When you overfit the training data, the test MSE will be very large because the supposed patterns that the method found in the training data simply don’t exist in the test data

I recommend you read Quora answer by William Chen in this link, he explained it very well in a layman term : https://www.quora.com/What-is-an-intuitive-explanation-of-over-fitting-particularly-with-a-small-sample-set-What-are-you-essentially-doing-by-over-fitting-How-does-the-over-promise-of-a-high-R%C2%B2-low-standard-error-occur

After you reading it, you can find the free-ebook An Introduction to Statistical Learning with Application in R by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani.

edit: link to the ebook http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf

Read the chapter 2 , 2.1 and 2.2 , they explained it very detail with the awesome illustration.

I hope this help you.