Solved – Performance in training set worse than in test set

glmnetmachine learningsupervised learning

I have a high dimensional regression problem where I used glmnet to solve this. A nested CV scheme is used. In the inner CV loop (10×5 fold) a grid search is done to find the optimal hyperparameters for glmnet. The outer CV loop (10×5 fold) is used to estimate performance of the regression with the previously found optimal parameters. However, when I apply this scheme something strange happens: the RMSE in the outer loop is way better than in the inner loop. This is absolutely counterintuitive because one would expect that the model works well in the inner loop because it was trained directly on these data and the model should be worse in the outer loop because it has "never seen" this data beforehand. Has anyone an explanation? My data set has 172 instances in total and 474 variables.

Best Answer

You probably have the learning curve working in your advantage, ie. you can train better models given more data. Even though you are optimizing hyperparameters, the best performance in inner CV can be consistently lower because after the inner CV you get 20% more data points to train the model you evaluate in the outer CV. This depends on whether or not you have reached the required number of data instances to make the learning curve stagnate. You appear to be in a region in which it is still steeply increasing.

You will probably get a smaller difference if you use 5x10-fold CV in the inner procedure.

It should be noted that some parameters you obtain in this way might be suboptimal. Regularization parameters, for instance, are affected by sample size. If your data set is small (and yours is), you may misestimate the required regularizers because a 20% difference in sample size can be considerable. This can be remedied by using more CV folds in the inner procedure (ie. increase $k$ in $k$-fold CV).

Related Question