As this question and its answer pointed out, k-fold cross validation (CV) is used for model selection, e.g. choosing between linear regression and neural network. It's also suggested that after deciding on which kind of model to use, the final predictor should be trained with the entire data set. Here my question is: how can we evaluate the final predictor? Is it sufficient to just use the average of the k accuracies obtained during k-fold CV?
Solved – How to evaluate the final model after k-fold cross-validation
cross-validation
Related Solutions
"To produce the estimates on the test set do I simply average the weights and biases from each of the 10 different calibrated models and use this parametrization to produce outputs to compare with my test set for the target function?"
No. Cross-validation is a procedure for estimating the test performance of a method of producing a model, rather than of the model itself. So the best thing to do is to perform k-fold cross-validation to determine the best hyper-parameter settings, e.g. number of hidden units, values of regularisation parameters etc. Then train a single network on the whole calibration set (or several and pick the one with the best value of the regularised training criterion to guard against local minima). Evaluate the performance of that model using the test set.
In the case of neural networks, averaging the weights and biases of individual models won't work as different models will choose different internal representations, so the corresponding hidden units of different networks will represent different (distributed) concepts. If you average their weights, they mean of these concepts will be meaningless.
You can combine rolling forward origin with k-fold cross-validation (aka backtesting with cross-validation). Determine the folds up-front once, and at each rolling time iterate through the k folds, train on k-1 and test on k. The union of all the held out test folds gives you one complete coverage of the entire dataset at that time, and the train folds cover the dataset k-1 times at that time, which you can aggregate in whatever way is appropriate (e.g., mean). Then score train and test separately as you ordinarily would to get the separate train/test scores at that time.
When optimizing parameters, create a separate holdout set first, and then do the cross-validation just described on only the remaining data. For each parameter to be optimized, you need to decide whether that parameter is independent of time (so you can perform the optimization over all rolling times) or dependent on time (so the parameter is optimized separately at each time). If the latter, you might represent the parameter as a function of time (possibly linear) and then optimize the time-independent coefficients of that function over all times.
Best Answer
When training on each fold (90%) of the data, you will then predict on the remaining 10%. With this 10% you will compute an error metric (RMSE, for example). This leaves you with: 10 values for RMSE, and 10 sets of corresponding predictions. There are 2 things to do this these results:
Inspect the mean and standard deviation of your 10 RMSE values. k-fold takes random partitions of your data, and the error on each fold should not vary too greatly. If it does, your model (and its features, hyper-parameters etc.) cannot be expected to yield stable predictions on a test set.
Aggregate your 10 sets of predictions into 1 set of predictions. For example, if your training set contains 1,000 data points, you will have 10 sets of 100 predictions (10*100 = 1000). When you stack these into 1 vector, you are now left with 1000 predictions: 1 for every observation in your original training set. These are called out-of-folds predictions. With these, you can compute the RMSE for your whole training set in one go, as rmse = compute_rmse(oof_predictions, y_train). This is the likely the cleanest way to evaluate the final predictor.