I understand that most algorithms are optimized to minimize the training error but why is the test error usually larger then the training error? Is there a statistical reason why?
Solved – Why does the training error usually underestimate the test error
machine learningtrain
Related Solutions
First, I think you're mistaken about what the three partitions do. You don't make any choices based on the test data. Your algorithms adjust their parameters based on the training data. You then run them on the validation data to compare your algorithms (and their trained parameters) and decide on a winner. You then run the winner on your test data to give you a forecast of how well it will do in the real world.
You don't validate on the training data because that would overfit your models. You don't stop at the validation step's winner's score because you've iteratively been adjusting things to get a winner in the validation step, and so you need an independent test (that you haven't specifically been adjusting towards) to give you an idea of how well you'll do outside of the current arena.
Second, I would think that one limiting factor here is how much data you have. Most of the time, we don't even want to split the data into fixed partitions at all, hence CV.
Say the data is generated by some underlying distribution $f$. We want to learn a model that performs well on future data generated by the same distribution. The true generalization performance of a model is the expected value of the error over $f$. Unfortunately, we only have access to a finite dataset sampled from $f$, which must be used both to train the model (including hyperparameters) and estimate generalization performance. Finite samples are variable--if we were to draw multiple datasets from $f$, each one would be different, and none would perfectly represent the underlying distribution.
Suppose we split the dataset into a training set (used to train the model parameters) and a validation set (used to select the hyperparameters). By tuning the hyperparameters on the validation set, we're selecting a model (out of multiple possibilities) that performs best on that particular sample. In that sense, this operation is not fundamentally different than choosing regular parameters using the training set. Just as it's possible to overfit the training set, it's possible to overfit the validation set. Because samples are variable, a model may have low error on the validation set because it happens to be a good match for the particular, random values in that sample, rather than because it truly matches the underlying distribution $f$. In this case, the model's error on the validation set will be lower than its expected error over $f$. The chance of finding such a model increases as the number of models we select from grows, and as the size of the validation set shrinks.
The remedy to this issue is to estimate generalization performance on an independent subset of the data that has not affected the model in any way (including affecting regular parameters, hyperparemeters, or even preprocessing and decisions by the analyst). For example, the data can be split into independent training, validation, and test sets, or nested cross validation can be used.
For more information about these issues, see:
Cawley and Talbot (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation.
Best Answer
Training and testing data are not identical.
As you yourself point out, most training optimizes the model performance on the training set; clearly it would tend to be worse on a different set of data.
Consider a really simple case of two samples (training and testing samples) from one population; the sample mean of the training set is closest (in the specific mean square error sense) to the training set, while its mean square error from the test set includes an additional term that is related to the square of the difference in the two sample means.