Solved – Why is training error a better performance metric than cross-validation error

cross-validationdeep learningmodel selectionoptimization

What I have learned is that I should use cross-validation performance for selecting the best model. Currently, selecting the models based on cross-validation performance gives lower test performance than if I select the models by training performance. I have 300-400 features and about 10000 samples, with 12 outputs (multi-class classification).

I've split the dataset in 80:20 % shares (training:test). I'm using H2O's deep learning with 5-fold cross-validation.
Is it wrong to use the training log-loss instead of the cross-validation log-loss?
As I've understood it, the cross-validation validation sets are used for scoring the models (5). The XVAL log-loss is then the average (?) of the 5 models trained.

I have 2 loops, one loop for sampling parameters such as epochs (between 1 and 5), number of layers (2-5), and number of nodes in the layers (10-50).
For each network parameter configuration in the outer loop, I train 100 models, and of these I'm checking among others the average log-loss for both training and cross-validation, variance of the log-loss, average true-rate for training and testing, and average profit on test set.

If I selected the 10 best model configurations based on the average training log-loss or average training true-rate, it would give higher true-rate and profit on the test set, compared to using cross-validation metrics.
I think the cross-validation log-loss is good for training (true-rate is probably not good for this), but do I have to also use it for selecting model configurations?

I previously did feature selection and later parameter tuning by following the cross-validation metrics, but now that I would get better results by following the training log-loss instead I'm wondering if I should do the feature selection and parameter tuning again?

Best Answer

There isn't a single "best" answer to your question. There are many threads throughout CV that discuss these issues. Here's one ...

Understanding how good a prediction is, in logistic regression

Key points to note are that training or calibration data provides useful information regarding optimal fit (e.g., Harrell recommends using a nonparametric loess smoother for this) but is well known to be positively biased:

The calibration curve is both a measure of goodness-of-fit and a great way to check the accuracy of probabilities estimated by the model.

Regardless, the emphasis in machine learning is on using validation (out-of-sample) data to evaluate fit. Validation results provide useful information regarding the bias-variance tradeoff that can't be inferred from calibration information.

  • If your validation error and training error are both high, you have underfitting (bias).

  • If your validation error is high but training error is low, you have overfitting (variance).

Here's one discussion of this:

Question about bias-variance tradeoff