Short answer: it is neither wrong nor new.
We've been discussing this validation scheme under the name "set validation" ≈ 15 a ago when preparing a paper*, but in the end never actually referred to it as we didn't find it used in practice.
Wikipedia refers to the same validation scheme as repeated random sub-sampling validation or Monte Carlo cross validation
From a theory point of view, the concept was of interest to us because
- it is another interpretation of the same numbers usually referred to as hold-out (just the model the estimate is used for is different: hold-out estimates are used as performance estimate for exactly the model tested, this set or Monte Carlo validation treats the tested model(s) as surrogate model(s) and interprets the very same number as performace estimate for a model built on the whole data set - as it is usually done with cross validation or out-of-bootstrap validation estimates)
- and it is somewhere in between
- more common cross validation techniques (resampling with replacement, interpretation as estimate for whole-data model),
- hold-out (see above, same calculation + numbers, typically without N iterations/repetitions, though and different interpretation)
- and out-of-bootstrap (the N iterations/repetitions are typical for out-of-bootstrap, but I've never seen this applied to hold-out and it is [unfortunately] rarely done with cross validation).
* Beleites, C.; Baumgartner, R.; Bowman, C.; Somorjai, R.; Steiner, G.; Salzer, R. & Sowa, M. G. Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).
The "set validation" error for N = 1 is hidden in fig. 6 (i.e. its bias + variance can be recostructed from the data given but are not explicitly given.)
but it seems not optimal in terms of variance. Are there arguments in favor or against the second procedure?
Well, in the paper above we found the total error (bias² + variance) of out-of-bootstrap and repeated/iterated $k$-fold cross validation to be pretty similar (with oob having somewhat lower variance but higher bias - but we did not follow up to check whether/how much of this trade-off is due resampling with/without replacement and how much is due to the different split ratio of about 1 : 2 for oob).
Keep in mind, though, that I'm talking about accuracy in small sample size situations, where the dominating contributor to variance uncertainty is the same for all resampling schemes: the limited number of true samples for testing, and that is the same for oob, cross validation or set validation. Iterations/repetitions allow you to reduce the variance caused by instability of the (surrogate) models, but not the variance uncertainty due to the limited total sample size.
Thus, assuming that you perform an adequately large number of iterations/repetitions N, I'd not expect practically relevant differences in the performance of these validation schemes.
One validation scheme may fit better with the scenario you try to simulate by the resampling, though.
I gave a general answer to this question, and here is what applies to your question:
Train and Validation Split:
First split the input into train and validation; but I'd also take the domain knowledge into account. In your case, I would take that year parameter into account and take the last few years of the data (not sure how many years your have, let say 2 out of 10, if you have 10) and assume that portion of the data as your validation set.
Nested Cross Validation and Parameter Search:
Now you can do what you explain in your diagram. Assume you have a method, which takes the input data, and the parameters (e.g. a parameter defining to use a GAM with a poisson family or a GAM with a negative binomial), and fit the corresponding model on the data. Let's call the set of all these parameters you're considering, a parameter grid.
Now for each of those outer folds, you do a whole grid search using the inner folds, to get a score for each parameter set. Then train your model using that best parameter set on the whole data given to you in the inner loop, and get its performance on the test portion of the outer loop.
Assume your parameter grid has 3 values in total (e.g. a GAM with a poisson family or a GAM with a negative binomial, and a [not regularized] linear model), and there's no other parameter involved. Then you'd do these many trainings:
$5 [\text{outer loop}] \times \left(3[\text{parameter grid}] \times 4[\text{inner loop}] + 1[\text{best parameters}]\right)$
Talking in code, here's how it'd look like:
parameter_grid = {'param1: ['binomial', 'poisson'],
'smoothing': ['yes', 'no']}
scores = []
for train, test in outer_folds:
model = GridSearchCV(estimator=my_custom_model,
param_grid=parameter_grid,
refit=True,
cv=4)
model.fit(train.X, train.y)
scores.append(model.score(test.X, test.y))
score = mean(scores)
For simplicity, I'm diverging from the actual API, so the above code is more like a psuedocode, but it gives you the idea.
This gives you an idea about how your parameter grid would perform on your data. Then you may think you'd like to add a regularization parameter to your linear model, or exclude one of your GAMs, etc. You do all the manipulations on your parameter set at this stage.
Final Evaluation:
Once you're done finding a parameter grid you're comfortable with, you'd then apply that on your whole train data. You can do a grid search on your whole train data with a normal 5 fold cross validation WITHOUT manipulating your parameter grid to find the pest parameter set, train a model with those parameters on your whole train data, and then get its performance on your validation set. That result would be your final performance and if you want your results to be as valid as thy can be, you should not go back to optimize any parameters at this point.
To clarify the parameter search at this stage, I'm getting help from scikit-learn
API in Python
:
parameter_grid = {'param1: ['binomial', 'poisson'],
'smoothing': ['yes', 'no']}
model = GridSearchCV(estimator=my_custom_model,
param_grid=parameter_grid,
refit=True,
cv=5)
model.fit(X_train, y_train)
model.score(X_validation, y_validation)
The above code does (model.fit(...)
) a 5 fold cross validation (cv=5
) on your training data, fits the best model on the whole data (refit=True
), and finally gives your the score on the validation set (model.score(...)
).
Deciding on what to out in your parameter_grid
in this stage is what you do in the previous stage. You can include/exclude all the parameters you mention in there and experiment and evaluate. Once you're certain about your choice of parameter grid, then you move on to the validation stage.
Best Answer
The correlation coefficient isn't really a measure of predictive performance (except in the special case of linear regression). For example, these two vectors have 100% correlation:
That said, it's a measure that most people will recognise, and unless your models are going very wrong, a higher correlation generally means more accurate predictions.
Other measures you can use include:
$$\frac{1}{n} \sum(y - \hat{y})^2$$
$$\frac{1}{n} \sum|y-\hat{y}|$$
$$\frac{1}{n} \sum \frac{|y - \hat{y}|}{y}$$
The last one probably makes the most sense for strictly positive data, but be aware that if the denominator is small, you can get an exaggerated error measure.