Leave-One-Out Cross-Validation – How It Works and Selecting the Final Model

I have some data and I want to build a model (say a linear regression model) out of this data. In a next step, I want to apply Leave-One-Out Cross-Validation (LOOCV) on the model so see how good it performs.

If I understood LOOCV right, I build a new model for each of my samples (the test set) using every sample except this sample (the training set). Then I use the model to predict the test set and calculate the errors $(\text{predicted} – \text{actual})$.

In a next step I aggregate all the errors generated using a chosen function, for example mean squared error. I can use these values to judge on the quality (or goodness of fit) of the model.

Question: Which model is the model these quality-values apply for, so which model should I choose if I find the metrics generated from LOOCV appropriate for my case? LOOCV looked at $n$ different models (where $n$ is the sample size); which one is the model I should choose?

Is it the model which uses all the samples? This model was never calculated during the LOOCV process!
Is it the model which has the least error?

Best Answer

It is best to think of cross-validation as a way of estimating the generalisation performance of models generated by a particular procedure, rather than of the model itself. Leave-one-out cross-validation is essentially an estimate of the generalisation performance of a model trained on $n-1$ samples of data, which is generally a slightly pessimistic estimate of the performance of a model trained on $n$ samples.

Rather than choosing one model, the thing to do is to fit the model to all of the data, and use LOO-CV to provide a slightly conservative estimate of the performance of that model.

Note however that LOOCV has a high variance (the value you will get varies a lot if you use a different random sample of data) which often makes it a bad choice of estimator for performance evaluation, even though it is approximately unbiased. I use it all the time for model selection, but really only because it is cheap (almost free for the kernel models I am working on).

Best Answer

Related Solutions

Solved – Leave-one-out cross validation and boosted regression trees

Solved – Leave One Out Cross Validation

Related Question