Solved – Machine learning- Calculating predictive accuracy – cross validation vs accuracy on unseen data

accuracycross-validationmachine learningpredictive-models

Which is the best/most reliable representation of a model's predictive accuracy, the accuracy based on the n-fold cross validation or the model's accuracy on an unseen dataset?

At a glance, I would definitely have thought that predicting the unseen dataset would be a better measure of predictive accuracy but on second thoughts it seems likely that this will be smaller than the training dataset and hence more prone to anomalies causing mis-representations in the model's predictive power.

Any references in the answer would be greatly appreciated.

EDIT: Describing current setup –
With a dataset of 100k , I randomly selected 10% and removed these building the model on the remaining dataset. I used the 10 % to generate predictions and calculate the accuracy, which of course differs from the accuracy of predictions from cross validation. Which of these realistically represents the model's predictive accuracy more?

Thanks

Best Answer

Predictive accuracy always needs to be calculated on unseen data - whether that data is unseen via cross validation splits or via a separate data set.

So often the most important point is to avoid leaks between training and test data. This may be easier to achieve with hold out (e.g. by obtaining test cases only after model training is finished) than for resampling.
But careful: very often "hold out" or "independent test" are used that are in fact a single random split of the available data set. That procedure is of course prone to the same data leaks that cross validation is.

Yes, for simple data, cross validation makes more efficient use of your data. And in small sample size situations, that can be the crucial advantage of resampling. But when you have to deal with multiple confounders and need to split independently for all those confounders, that advantage vanishes very fast because you end up excluding large parts of your data from both test and training set for each surrogate model.

Related:


UPDATE: described scenario of 100k (I assume cases) x unknown no of variates.

That is certainly not a small sample size situation. In this situation, a random hold out set of 10 % = 10000 cases should have no practically relevant difference to cross validation results. The more so, as a random subset is prone to the same data leaks that cross validation is prone to as well: confounders that lead to clustering in the data. If you have such confounders, your effective sample size may be orders of magnitude below the 100k rows, and any kind of splitting that doesn't take care of those confounders will mean a data leak between training and test and lead to overoptimistic bias in the error estimates.

The more efficient use of cases in cross validation is mostly relevant with small data sets where

  1. stability of the model is an issue and must be checked (which is easily done by cross validation), and
  2. uncertainty of the test result due to small numbers of test cases is large
    here cross validation is better as a full run will test each case.

For theory, I recommend reading up the relevant parts of The Elements of Statistical Learning.

These papers have empirical results on bias and variance of different validation schemes (though they deal explicitly with small sample size situations):