Solved – Performance in training set worse than in test set

glmnetmachine learningsupervised learning

I have a high dimensional regression problem where I used glmnet to solve this. A nested CV scheme is used. In the inner CV loop (10×5 fold) a grid search is done to find the optimal hyperparameters for glmnet. The outer CV loop (10×5 fold) is used to estimate performance of the regression with the previously found optimal parameters. However, when I apply this scheme something strange happens: the RMSE in the outer loop is way better than in the inner loop. This is absolutely counterintuitive because one would expect that the model works well in the inner loop because it was trained directly on these data and the model should be worse in the outer loop because it has "never seen" this data beforehand. Has anyone an explanation? My data set has 172 instances in total and 474 variables.

Best Answer

You probably have the learning curve working in your advantage, ie. you can train better models given more data. Even though you are optimizing hyperparameters, the best performance in inner CV can be consistently lower because after the inner CV you get 20% more data points to train the model you evaluate in the outer CV. This depends on whether or not you have reached the required number of data instances to make the learning curve stagnate. You appear to be in a region in which it is still steeply increasing.

You will probably get a smaller difference if you use 5x10-fold CV in the inner procedure.

It should be noted that some parameters you obtain in this way might be suboptimal. Regularization parameters, for instance, are affected by sample size. If your data set is small (and yours is), you may misestimate the required regularizers because a 20% difference in sample size can be considerable. This can be remedied by using more CV folds in the inner procedure (ie. increase $k$ in $k$-fold CV).

Nested cross validation explained without nesting

Here's how I see (nested) cross validation and model building. Note that I'm chemist and like you look from the application side to the model building process (see below). My main point here is from my point of view I don't need a dedicated nested variety of cross validation. I need a validation method (e.g. cross validation) and a model training function:

model = f (training data)

"my" model training function f does not need any hyperparameters because it internally does all hyperparameter tuning (e.g. your alpha, lambda and threshold).

In other words, my training function may contain any number of inner cross validations (or out-of-bag or what ever performance estimate I may deem useful). However, note that the distinction between parameters and hyper-parameters typically is that the hyperparameters need to be tuned to the data set/application at hand whereas the parameters can then be fitted regardless of what data it is. Thus from the point of view of the developer of a new classification algorithm, it does make sense to provide only the "naked" fitting function (g (training data, hyperparameters)) that fits the parameters if given data and hyperparameters.

The point of having the "outer" training function f is that after you did your cross validation run, it gives you straightforward way to train "on the whole data set": just use f (whole data set) instead of the call f (cv split training data) for the cross validation surrogate models.

Thus in your example, you'll have 5+1 calls to f, and each of the calls to f will have e.g. 100 * 5 calls to g.

probability threshold

While you could do this with yet another cross validation, this is not necessary: it is just one more hyperparameter your ready-to-use model has and can be estimated inside f.

What you need to fix it is a heuristic that allows you to calculate such a threshold. There's a wide variety of heuristics (from ROC and specifying how important it is to avoid false positives compared to false negatives over minimum acceptable sensitivity or specificity or PPV or NPV to allowing two thresholds and thus an "uncertain" (NA) level and so on) that are suitable in different situations - good heuristics are usually very application specific.

But for the question here, you can do this inside f and e.g. using the predictions obtained during the inner cross validation to calculate ROC and then find your working point/threshold accordingly.

Specific Comments to parts of the question

I understand that I shouldn't report the performance from the CV used to pick the optimal hyperparameters as an estimate of the expected performance of my final model (which would be overly-optimistic) but should instead include an outer CV loop to get this estimate.

Yes. (Though the inner estimate does carry information in relation to the outer estimate: if it is much more optimisitc than the outer estimate, you are typically overfitting.)

I understand that the inner CV loop is used for model selection

Any kind of data-driven model tuning, really -> that includes tuning your cutoff-threshold.

(in this case, the optimal hyperparameters) and that the outer loop is used for model evaluation, i.e., the inner and outer CV serve two different purposes that often are erroneously conflated.

Yes.

That is, the hyperparameter tuning is part of "the method for building the model".

I prefer to see it this way as well: I'm chemist and like you look from the application side: for me a trained/fitted model is not complete without the hyperparameters, or more precisely, a model is something I can use directly to obtain predictions. Though as you note other people have a different view (without hyperparameter tuning). In my experience, this is often the case with people developing new models: hyperparameter tuning is then a "solved problem" and not considered. (side note: their view on what cross validation can do in terms of validation is also slightly different from what cross validation can do from the application side).

Solved – Compare different classification algorithms after hyperparameter tuning

I try to answer my own question.

Lets start with the selection of the model:

Should I use an SVM, Neural Network (NN) or Random Forest (RF)?

To answer this question, the following should be done:

For each of the three different classifiers, define a parameter grid of the hyperparameters, which should be analyzed
For each of the three classifiers, do a separate nested cross validation during which the hyperparameters are tuned
The selection of hyperparameters in the inner loop will vary from fold to fold. Therefore this is not a way to select any hyperparameters
Instead, use only the average score of the outer loops to get an unbiased estimate of the model under consideration (whereby model means here SVM, NN or RF)
For each model (again: SVM, NN or RF) one of these values is retrieved from the nested CV. This value is then an unbiased metric to tell which model works best on the task. Select this model.

Now a model is selected and I can tell whether I want to choose an SVM, NN and RF. Notice that the choice of hyperparameters was not made until now! Lets assume from now on that the Random Forest yielded the highest score so that we want to use it as our model.

Which hyperparameters should I use for my model, (i.e. Random Forest)?

To answer this, do the following:

Use the grid of hyperparameters for the RandomForest defined before.
Loop over the hyperparameters (or combinations of hyperparameters): for params in hyperparameters:
- Do k-fold cross validation (single, not nested!) during which we train the RF with params
- Measure then mean score, i.e. the average score over all k folds
The hyperparameters with the highest mean score are the ones which should be chosen for the final model

Now the hyperparameters of the Ranfom Forest are determined.

We have two score values for the Random Forest: (i) the one from the nested cross validation where (possibly and almost certainly) different hyperparameters were considered in each inner fold and (ii) the score of the single cross validation during which we found our best hyperparameters for the RF.

The first one only told us which model to choose whereas the second one tells us how well the model including fixed hyperparameters performs.

A final model used for prediction of new unseen data would then be the Random Forest with the selected hyperparameters trained on all data.

There is just one small thing left open: I would think that the second score value (the one from the single CV) is a proper measure of how well our final model (including hyper parameters) works on new data. Is this correct? And could one say that in general the nested cv score of this model is either a) higher b) lower or c) equal to the non nested score? A comment on this would be highly appreciated.

Best Answer

Related Solutions

Solved – How to build the final model and tune probability threshold after nested cross-validation

Nested cross validation explained without nesting

probability threshold

Specific Comments to parts of the question

Solved – Compare different classification algorithms after hyperparameter tuning

Related Question