Solved – Performance drop between training and validation datasets

boostingvalidation

I have been using R's GBM (Gradient Boosting Machine) package for several months. I typically split my data into three partitions: Training, Validation, and Testing. I use the validation data set to appropriately pick the optimal number of iterations. The testing data is a completely untainted data set used for nothing other than final reporting.

I have noticed, however, that the mean error for the validation set at the optimal number of trees is often quite higher than the training data set.

My question is: Do I care that the training and validation error are drastically different? Or do I only care that the validation and testing error are close?

The old guard in my office is convinced that the training and validation error must be similar otherwise the model will not generalize well. For an algorithm like GBM that can perfectly predict training data given enough time, I believe the real assessment of generalization is between the validation and test data sets.

EDIT #1:

I am usually training a model to predict a binary outcome. Therefore the error measurement is binomial deviance. My data sets are large enough that sample size shouldn't be an issue. I typically build on 100k records and ~200 features split into thirds for the train, validation, and test data sets. My target variable is often imbalanced at about 10/90 ratio or even less.

Best Answer

If the training error is a lot lower than the test set error, that is usually an indication of over-fitting, but at the end of the day, it is generalisation performance that really matters. Given a choice between a model that has a 0% error on the training set and 20% on the test set, and a model that has a training error of 20% and a test error of 21%, I'll use the former rather than the latter, provided the test set is large enough to be a reliable indicator of generalisation performance.

If you have a problem with more features than training cases, you can always make a linear classifier that is able to classify all of the patterns in the training set without error (provided the points are in general position). In this case it is normal to use regularisation to obtain a classifier with a large margin (c.f. SVM) which improves generalisation. However, you will still end up with a training error of zero, even for a classifier that performs well.

Essentially the training error is potentially misleading and I would recommend most users to ignore it entirely and concentrate on validation set performance (but be aware it is possible to overfit the validation set as well by making lots of adjustments to the model in order to improve validation set performance).

Have you tried using a support vector machine (or some other regularised machine learning method), these are relatively easy to optimise (as there are only a few regularisation and kernel parameters to tune). The GBM looks to me a little tricky to optimise the parameters (e.g. what value of the regularisation parameter to use).