Solved – Leave One Out Cross Validation

cross-validationmachine learningregression

I tried to implement the Leave One Out Cross Validation (LOOCV) method to get me a best combination of 4 data points to train my model which is of the form:

Y= a + b X1 + c X2.

Where a, b and c are the coefficients based on regression. I have a set of 20 data points on the whole to train my model but I want to restrict my model to be trained from a set of 4 data points and be able to predict the other data points fairly well.

So I used the LOOCV method to achieve that by starting with 19 data points as my training set and the remaining 1 data point as my test set and calculating the error. Then I eliminated the data point (from both training and test data set) that resulted in the minimum error as this data point has been best predicted by the model. Then I kept on iterating until I came to 4 data points.

But the problem is, when I tried to add the data point that has been left out from each iteration and compute the Root Mean Square Error Value by applying the model to the whole set of data points at each iteration, I presume that it should have kind of an asymptotic shape when I plot the Root Mean Square Value Vs the No. of data points i.e. higher RMSE at lower no. of data points and converging to the lowest RMSE values at higher no. of data points. But this is the plot I'm getting for the data set.

enter image description here

I don't know why is the plot showing this type of a behavior. Also, Is it valid to use the Leave One Out Cross Validation method as I have used it?. Please, clarify.

Best Answer

After thinking a bit about your problem it is essentially coreset selection, i.e., finding a small subset (the coreset) of the training data such that the model trained on the subset is as close as possible to the model trained on the full dataset.

I'm not familiar with the area and it isn't a very easy term to google, but the paper Near-optimal Coresets For Least-Squares Regression (and the papers it references) is probably very relevant. It doesn't look like they directly solve the problem of finding the best $k$ points though. Instead their algorithm produces a coreset of size polynomially bounded in terms of a desired accuracy parameter and the rank of the data matrix. You might be able to massage their technique into want you want though by playing around with the bounds.

One thing I do want to mention is the idea of using some sort of CV procedure, like fitting a model for each $\binom{20}{4}$ split of your data and selecting the 4 points that results in a model with the minimum error on the remaining 16 points, as your search technique seems flawed. Notice that there is a sort of symmetry, as you could also view this procedure as picking the 16 points that are most easily predicted from 4, which almost certainly isn't what you really want. Essentially all such a procedure would be doing is searching for the split of your data that makes the problem easiest.