Solved – Cross Validation and Nearest Neighbors

cross-validationk nearest neighbour

What is the best way to approach multiple-fold cross validation for a 1 nearest-neighbor model used for prediction?

A common approach to cross validation is to, for example, split the dataset into 10 folds, train on on 9 of them, and test on 1 (repeating for each of the 10 folds).

In a nearest-neighbors model the concepts of "training set" and "test set" do not apply in the same way that they do for a regression. An approximation of the above procedure would be to split the dataset into 10 folds, choose 1 fold as the "test set", and search for nearest neighbors in the remaining 9 (repeating for each fold). I could then calculate the MSE across the 10 folds by comparing predicted and actual responses. Is this a good approach to cross validation? Are there good alternative approaches in this case?

A specific alternative I am considering is doing splits that would leave more data in the test set than the training set. So, for example, I would choose 9 of the 10 folds as the "test set" and search for nearest neighbors in the remaining 1 (repeating for each fold). Is there an advantage/disadvantage to doing it this way?

Some specifics on my data: I have roughly 20K observations, roughly 30 features and am predicting multiple responses (roughly 100). I am interested in using cross validation for model selection / evaluation. Because I am using 1 nearest neighbors, I expect the variance of the model to be high. Multiple-fold cross validation is therefore desirable to get a better estimate of the prediction MSE.

Best Answer

Usually, the larger the $k$ (the number of folds of the cross validation), the more accurate is the estimation of the RMSE. See Choice of K in K-fold cross-validation, per example.

In your case, performing a leave-one-out cross-validation (LOOCV) is not much more expansive than a ten fold! Indeed, in a ten fold, you will be doing predictions using 90% of $n$ lines. The overall number of operations will be $0.9n^2$. If you use all the training set (except the element you are trying to predict), you will be doing $n^2$ operations.

So, as the performance penalty for running a LOOCV is low (and this is true because you are using a KNN), I would probably use this.