Solved – Why is training error a better performance metric than cross-validation error

cross-validationdeep learningmodel selectionoptimization

What I have learned is that I should use cross-validation performance for selecting the best model. Currently, selecting the models based on cross-validation performance gives lower test performance than if I select the models by training performance. I have 300-400 features and about 10000 samples, with 12 outputs (multi-class classification).

I've split the dataset in 80:20 % shares (training:test). I'm using H2O's deep learning with 5-fold cross-validation.
Is it wrong to use the training log-loss instead of the cross-validation log-loss?
As I've understood it, the cross-validation validation sets are used for scoring the models (5). The XVAL log-loss is then the average (?) of the 5 models trained.

I have 2 loops, one loop for sampling parameters such as epochs (between 1 and 5), number of layers (2-5), and number of nodes in the layers (10-50).
For each network parameter configuration in the outer loop, I train 100 models, and of these I'm checking among others the average log-loss for both training and cross-validation, variance of the log-loss, average true-rate for training and testing, and average profit on test set.

If I selected the 10 best model configurations based on the average training log-loss or average training true-rate, it would give higher true-rate and profit on the test set, compared to using cross-validation metrics.
I think the cross-validation log-loss is good for training (true-rate is probably not good for this), but do I have to also use it for selecting model configurations?

I previously did feature selection and later parameter tuning by following the cross-validation metrics, but now that I would get better results by following the training log-loss instead I'm wondering if I should do the feature selection and parameter tuning again?

Best Answer

There isn't a single "best" answer to your question. There are many threads throughout CV that discuss these issues. Here's one ...

Understanding how good a prediction is, in logistic regression

Key points to note are that training or calibration data provides useful information regarding optimal fit (e.g., Harrell recommends using a nonparametric loess smoother for this) but is well known to be positively biased:

The calibration curve is both a measure of goodness-of-fit and a great way to check the accuracy of probabilities estimated by the model.

Regardless, the emphasis in machine learning is on using validation (out-of-sample) data to evaluate fit. Validation results provide useful information regarding the bias-variance tradeoff that can't be inferred from calibration information.

If your validation error and training error are both high, you have underfitting (bias).
If your validation error is high but training error is low, you have overfitting (variance).

Here's one discussion of this:

Question about bias-variance tradeoff

Related Solutions

Solved – Cross Validation (error generalization) after model selection

The key thing to remember is that for cross-validation to give an (almost) unbiased performance estimate every step involved in fitting the model must also be performed independently in each fold of the cross-validation procedure. The best thing to do is to view feature selection, meta/hyper-parameter setting and optimising the parameters as integral parts of model fitting and never do any one of these steps without doing the other two.

The optimistic bias that can be introduced by departing from that recipe can be surprisingly large, as demonstrated by Cawley and Talbot, where the bias introduced by an apparently benign departure was larger than the difference in performance between competing classifiers. Worse still biased protocols favours bad models most strongly, as they are more sensitive to the tuning of hyper-parameters and hence are more prone to over-fitting the model selection criterion!

Answers to specific questions:

The procedure in step 1 is valid because feature selection is performed separately in each fold, so what you are cross-validating is whole procedure used to fit the final model. The cross-validation estimate will have a slight pessimistic bias as the dataset for each fold is slightly smaller than the whole dataset used for the final model.

For 2, as cross-validation is used to select the model parameters then you need to repeat that procedure independently in each fold of the cross-validation used for performance estimation, you you end up with nested cross-validation.

For 3, essentially, yes you need to do nested-nested cross-validation. Essentially you need to repeat in each fold of the outermost cross-validation (used for performance estimation) everything you intend to do to to fit the final model.

For 4 - yes, if you have a separate hold-out set, then that will give an unbiased estimate of performance without needing an additional cross-validation.

Solved – Feature selection and cross-validation

If you perform feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.

Consider this example. We generate some target data by flipping a coin 10 times and recording whether it comes down as heads or tails. Next, we generate 20 features by flipping the coin 10 times for each feature and write down what we get. We then perform feature selection by picking the feature that matches the target data as closely as possible and use that as our prediction. If we then cross-validate, we will get an expected error rate slightly lower than 0.5. This is because we have chosen the feature on the basis of a correlation over both the training set and the test set in every fold of the cross-validation procedure. However, the true error rate is going to be 0.5 as the target data is simply random. If you perform feature selection independently within each fold of the cross-validation, the expected value of the error rate is 0.5 (which is correct).

The key idea is that cross-validation is a way of estimating the generalization performance of a process for building a model, so you need to repeat the whole process in each fold. Otherwise, you will end up with a biased estimate, or an under-estimate of the variance of the estimate (or both).

HTH

Here is some MATLAB code that performs a Monte-Carlo simulation of this setup, with 56 features and 259 cases, to match your example, the output it gives is:

Biased estimator: erate = 0.429210 (0.397683 - 0.451737)

Unbiased estimator: erate = 0.499689 (0.397683 - 0.590734)

The biased estimator is the one where feature selection is performed prior to cross-validation, the unbiased estimator is the one where feature selection is performed independently in each fold of the cross-validation. This suggests that the bias can be quite severe in this case, depending on the nature of the learning task.

NF    = 56;
NC    = 259;
NFOLD = 10;
NMC   = 1e+4;

% perform Monte-Carlo simulation of biased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform feature selection

   err       = mean(repmat(y,1,NF) ~= x);
   [err,idx] = min(err);

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, '  Biased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

% perform Monte-Carlo simulation of unbiased estimator

erate = zeros(NMC,1);

for i=1:NMC

   y = randn(NC,1)  >= 0;
   x = randn(NC,NF) >= 0;

   % perform cross-validation

   partition = mod(1:NC, NFOLD)+1;
   y_xval    = zeros(size(y));

   for j=1:NFOLD

      % perform feature selection

      err       = mean(repmat(y(partition~=j),1,NF) ~= x(partition~=j,:));
      [err,idx] = min(err);

      y_xval(partition==j) = x(partition==j,idx(1));

   end

   erate(i) = mean(y_xval ~= y);

   plot(erate);
   drawnow;

end

erate = sort(erate);

fprintf(1, 'Unbiased estimator: erate = %f (%f - %f)\n', mean(erate), erate(ceil(0.025*end)), erate(floor(0.975*end)));

Best Answer

Related Solutions

Solved – Cross Validation (error generalization) after model selection

Solved – Feature selection and cross-validation

Related Question