Solved – Clarification on how to perform nested cross-validation

cross-validationgeneralized-additive-modelmethod-comparisonmortality

I'm currently trying to get lifetables for our population based on death and population counts. My first idea was to follow this paper methodology but after some discussions we are planning to use different models, such as GLMMs and GAMs and explore the inclusion of different knots and families.

The author suggested to use train and test sets or cross-validation. I've been reading (great resources and questions on this site!) and I would like to know if the methodology I'm using is appropriate as I'm not entirely sure.

Here's what I have in mind:

Shuffle the data
Divide the data into training (80%) and test set (20%).
Test the different parameters in the training data, and compare them using the test data.
Once I have the chosen family and knots, shuffle the data again.
Do n iterations (20?) of 5-fold cross validation (size dataset=3440 observations) of one model to get the performance of the methodology (The idea is that the model chosen in 4 is the one able to better predict the data).

Now, based on the figure I've uploaded, here are my questions and confusion:

Should I use a nested cross-validation and in the inner loop test the model parameters (as the figure)? Each k-fold correspond to one set of parameters or I need to repeat a nested CV for each set of parameters? The theoretical number of knots may be too low or I may need to add a smooth factor to one of the variables, etc. I'm not sure if I can play with these parameters in the inner loop/s.

This brings me to my other question. Do I need to do one k-fold cross validation per method? If I use cross-validation for a GAM with a poisson family and a GAM with a negative binomial I may be on a fold where that model is able to better predict the data, where in another fold the same model may not perform so well, so instance the set of knots could be inappropriate.

Another thing that I'm unsure of is that one variable is calendar year. Do I need special care when shuffling the data? Perhaps make sure each fold has information about all the years or something else? I also thought of using a GLMM or GAMM, perhaps that would be enough?

My other question relates to the output parameters and how to evaluate the cross validation. Since my aim is to predict I was planning to look at the residuals of each model, AIC, and plot the predicted and observed values to see which method performed better. Should I use different parameters?

I appreciate any help or guidance as I just started reading about cross validation and have no experience.

EDIT – following @adrin and @Gavin Simpson answers

If I understood your answer, basically nested cross validation would be something along those lines:

for each fold in the outer loop 
     for each parameter in the grid 
          run each parameter in the inner loop
          evaluate each parameter in each k-fold
     chose best parameter based on results from inner loop
     run best parameter in each k-fold of the outer loop

The parameters I would like to test are:

Different set of knots for age (some fixed at the boundaries but I would like to test a few others for middle ages).

The reason to choose knots is due to extreme values at the boundaries, as there's a large mortality for the first point (infant mortality) and a large mortality above 85 (since data above 85 is grouped).

I just got Simon Wood's book so I can understand more about GAMs. I thought about using cubic splines but thin splines could be interesting. It would be interesting to look at the difference between thin plate splines and the use of predefined knots. If thin plate splines fit the data well it would be a great option.

GLM with a poisson and NB family and a GAM with poisson and NB family.
I would like to check the inclusion of a random effect for region.
I would also like to test whether an interaction between gender and age, and region and age improves the model.

I wasn't thinking about using ti for interactions but instead something along those lines:

gam(deads ~ region*age + gender*age + calendar_year + ti(age,k=7,bs="cr") + offset(log(pop)), knots=list(age=list_knots), family=poisson, data=data, method="REML")

But I also thought of trying something like this:

gam(deads ~ s(age,bs="fs",by=region) + s(age,bs="fs",by=gender) + calendar_year + ti(age,k=7,bs="cr") + offset(log(pop)), knots=list(age=list_knots), family=poisson, data=data, method="REML")

Obviously, I need to read more to check what makes sense and is more appropriate.

Finally, I was wondering whether I should add some factor smoothing to the factor variable.

Now, perhaps I should have a clearer idea of the final model. Let me hear your thoughts. But having all of these parameters to test was/is some part of my confusion about cross-validation in the inner loop.

Best Answer

I gave a general answer to this question, and here is what applies to your question:

Train and Validation Split:

First split the input into train and validation; but I'd also take the domain knowledge into account. In your case, I would take that year parameter into account and take the last few years of the data (not sure how many years your have, let say 2 out of 10, if you have 10) and assume that portion of the data as your validation set.

Nested Cross Validation and Parameter Search:

Now you can do what you explain in your diagram. Assume you have a method, which takes the input data, and the parameters (e.g. a parameter defining to use a GAM with a poisson family or a GAM with a negative binomial), and fit the corresponding model on the data. Let's call the set of all these parameters you're considering, a parameter grid.

Now for each of those outer folds, you do a whole grid search using the inner folds, to get a score for each parameter set. Then train your model using that best parameter set on the whole data given to you in the inner loop, and get its performance on the test portion of the outer loop.

Assume your parameter grid has 3 values in total (e.g. a GAM with a poisson family or a GAM with a negative binomial, and a [not regularized] linear model), and there's no other parameter involved. Then you'd do these many trainings:

$5 [\text{outer loop}] \times \left(3[\text{parameter grid}] \times 4[\text{inner loop}] + 1[\text{best parameters}]\right)$

Talking in code, here's how it'd look like:

parameter_grid = {'param1: ['binomial', 'poisson'],
                  'smoothing': ['yes', 'no']}
scores = []
for train, test in outer_folds:
    model = GridSearchCV(estimator=my_custom_model,
                         param_grid=parameter_grid,
                         refit=True,
                         cv=4)
    model.fit(train.X, train.y)
    scores.append(model.score(test.X, test.y))
score = mean(scores)

For simplicity, I'm diverging from the actual API, so the above code is more like a psuedocode, but it gives you the idea.

This gives you an idea about how your parameter grid would perform on your data. Then you may think you'd like to add a regularization parameter to your linear model, or exclude one of your GAMs, etc. You do all the manipulations on your parameter set at this stage.

Final Evaluation:

Once you're done finding a parameter grid you're comfortable with, you'd then apply that on your whole train data. You can do a grid search on your whole train data with a normal 5 fold cross validation WITHOUT manipulating your parameter grid to find the pest parameter set, train a model with those parameters on your whole train data, and then get its performance on your validation set. That result would be your final performance and if you want your results to be as valid as thy can be, you should not go back to optimize any parameters at this point.

To clarify the parameter search at this stage, I'm getting help from scikit-learn API in Python:

parameter_grid = {'param1: ['binomial', 'poisson'],
                  'smoothing': ['yes', 'no']}
model = GridSearchCV(estimator=my_custom_model,
                     param_grid=parameter_grid,
                     refit=True,
                     cv=5)
model.fit(X_train, y_train)
model.score(X_validation, y_validation)

The above code does (model.fit(...)) a 5 fold cross validation (cv=5) on your training data, fits the best model on the whole data (refit=True), and finally gives your the score on the validation set (model.score(...)).

Deciding on what to out in your parameter_grid in this stage is what you do in the previous stage. You can include/exclude all the parameters you mention in there and experiment and evaluate. Once you're certain about your choice of parameter grid, then you move on to the validation stage.

Approach 1: require stable optimization results

With this approach, "model training" is the fitting of the "normal" model parameters, and hyperparameters are given. An inner e.g. cross validation takes care of the hyperparameter optimization.

The crucial step/assumption here to solve the dilemma of whose hyperparameter set should be chosen is to require the optimization to be stable. Cross validation for validation purposes assumes that all surrogate models are sufficiently similar to the final model (obtained by the same training algorithm applied to the whole data set) to allow treating them as equal (among themselves as well as to the final model). If this assumption breaks down and

the surrogate models are still equal (or equivalent) among themselves but not to the final model, we are talking about the well-known pessimistic bias of cross validation.
If also the surrogate model are not equal/equivalent to each other, we have problems with instability.

For the optimization results of the inner loop this means that if the optimization is stable, there is no conflict in choosing hyperparameters. And if considerable variation is observed across the inner cross validation results, the optimization is not stable. Unstable training situations have far worse problems than just the decision which of the hyperparameter sets to choose, and I'd really recommend to step back in that case and start the modeling process all over.

There's an exception, here, though: there may be several local minima in the optimization yielding equal performance for practical purposes. Requiring also the choice among them to be stable may be an unnecessary strong requirement - but I don't know how to get out of this dilemma.

Note that if not all models yield the same winning parameter set, you should not use outer loop estimates as generalization error here:

If you claim generalization error for parameters $p$, all surrogate models entering into the validation should actually use exactly these parameters.
(Imagine someone told you they did a cross validation on model with C = 1 and linear kernel and you find out some splits were evaluated with rbf kernel!)
But unless there is no decision involved as all splits yielded the same parameters, this will break independence in the outer loop: the test data of each split already entered the decision which parameter set wins as it was training data in all other splits and thus used to optimize the parameters.

Approach 2: treat hyperparameter tuning as part of the model training

This approach bridges the perspectives of the "training algorithm developer" and applied user of the training algorithm.

The training algorithm developer provides a "naked" training algorithm model = train_naked (trainingdata, hyperparameters). As the applied user needs tunedmodel = train_tuned (trainingdata) which also takes care of fixing the hyperparameters.

train_tuned can be implemented e.g. by wrapping a cross validation-based optimizer around the naked training algorithm train_naked.

train_tuned can then be used like any other training algorithm that does not require hyperparameter input, e.g. its output tunedmodel can be subjected to cross validation. Now the hyperparameters are checked for their stability just like the "normal" parameters should be checked for stability as part of the evaluation of the cross validation.

This is actually what you do and evaluate in the nested cross validation if you average performance of all winning models regardless of their individual parameter sets.

What's the difference?

We possibly end up with different final models taking those 2 approaches:

the final model in approach 1 will be train_naked (all data, hyperparameters from optimization)
whereas approach 2 will use train_tuned (all data) and - as that runs the hyperparameter optimization again on the larger data set - this may end up with a different set of hyperparameters.

But again the same logic applies: if we find that the final model has substantially different parameters from the cross validation surrogate models, that's a symptom of assumption 1 being violated. So IMHO, again we do not have a conflict but rather a check on whether our (implicit) assumptions are justified. And if they aren't, we anyways should not bet too much on having a good estimate of the performance of that final model.

I have the impression (also from seeing the number of similar questions/confusions here on CV) that many people think of nested cross validation doing approach 1. But generalization error is usually estimated according to approach 2, so that's the way to go for the final model as well.

Iris example

Summary: The optimization is basically pointless. The available sample size does not allow distinctions between the performance of any of the parameter sets here.

From the application point of view, however, the conclusion is that it doesn't matter which of the 4 parameter sets you choose - which isn't all that bad news: you found a comparatively stable plateau of parameters. Here comes the advantage of the proper nested validation of the tuned model: while you're not able to claim that the it is the optimal model, your're still able to claim that the model built on the whole data using approach 2 will have about 97 % accuracy (95 % confidence interval for 145 correct out of 150 test cases: 92 - 99 %)

Note that also approach 1 isn't as far off as it seems - see below: your optimization accidentally missed a comparatively clear "winner" because of ties (that's actually another very telltale symptom of the sample size problem).

While I'm not deep enough into SVMs to "see" that C = 1 should be a good choice here, I'd go with the more restrictive linear kernel. Also, as you did the optimization, there's nothing wrong with choosing the winning parameter set even if you are aware that all parameter sets lead to practically equal performance.

In future, however, consider whether your experience yields rough guesstimates of what performance you can expect and roughly what model would be a good choice. Then build that model (with manually fixed hyperparameters) and calculate a confidence interval for its performance. Use this to decide whether trying to optimize is sensible at all. (I may add that I'm mostly working with data where getting 10 more independent cases is not easy - if you are in a field with large independent sample sizes, things look much better for you)

long version:

As for the example results on the iris data set. iris has 150 cases, SVM with a grid of 2 x 2 parameters (2 kernels, 2 orders of magnitude for the penalty C) are considered.

The inner loop has splits of 129 (2x) and 132 (6x) cases. The "best" parameter set is undecided between linear or rbf kernel, both with C = 1. However, the inner test accuracies are all (including the always loosing C = 10) within 94 - 98.5 % observed accuracy. The largest difference we have in one of the splits is 3 vs. 8 errors for rbf with C = 1 vs. 10.

There's no way this is a significant difference. I don't know how to extract the predictions for the individual cases in the CV, but even assuming that the 3 errors were shared, and the C = 10 model made additional 5 errors:

> table (rbf1, rbf10)
         rbf10
rbf1      correct wrong
  correct     124     5
  wrong         0     3

> mcnemar.exact(rbf1, rbf10)

    Exact McNemar test (with central confidence intervals)

data:  rbf1 and rbf10
b = 5, c = 0, p-value = 0.0625
alternative hypothesis: true odds ratio is not equal to 1

Remember that there are 6 pairwise comparisons in the 2 x 2 grid, so we'd need to correct for multiple comparisons as well.

Approach 1

In 3 of the 4 outer splits where rbf "won" over the linear kernel, they actually had the same estimated accuracy (I guess min in case of ties returns the first suitable index).

Changing the grid to params = {'kernel':['linear', 'rbf'],'C':[1,10]} yields

({'kernel': 'linear', 'C': 1}, 0.95238095238095233, 0.97674418604651159)
({'kernel': 'rbf', 'C': 1}, 0.95238095238095233, 0.98449612403100772)
({'kernel': 'linear', 'C': 1}, 1.0, 0.97727272727272729)
({'kernel': 'linear', 'C': 1}, 0.94444444444444442, 0.98484848484848486)
({'kernel': 'linear', 'C': 1}, 0.94444444444444442, 0.98484848484848486)
({'kernel': 'linear', 'C': 1}, 1.0, 0.98484848484848486)
({'kernel': 'linear', 'C': 1}, 1.0, 0.96212121212121215)

Approach 2:

Here, clf is your final model. With random_state = 2, rbf with C = 1 wins:

In [310]: clf.grid_scores_
[...snip warning...]
Out[310]: 
[mean: 0.97333, std: 0.00897, params: {'kernel': 'linear', 'C': 1},
 mean: 0.98000, std: 0.02773, params: {'kernel': 'rbf', 'C': 1},
 mean: 0.96000, std: 0.03202, params: {'kernel': 'linear', 'C': 10},
 mean: 0.95333, std: 0.01791, params: {'kernel': 'rbf', 'C': 10}]

(happens about 1 in 5 times, 1 in 6 times linear and rbf with C = 1 are tied on rank 1)

Best Answer

Related Solutions

Solved – Inner loop overfitting in nested cross-validation

Solved – How to get hyper parameters in nested cross validation