Solved – Does cross-validation apply to K-Nearest Neighbors given no estimated parameters

cross-validationk nearest neighbourmachine learning

Cross validation involves (1) taking your original set X, (2) removing some data (e.g. one observation in LOO) to produce a residual "training" set Z and a "holdout" set W, (3) fitting your model on Z, (4) using the estimated parameters to predict the outcome for W, (5) calculating some predictive performance measure (e.g. correct classification), (6) repeating for different Z and W (e.g. new training/holdout sets), and (7) calculating average predictive performance for your model. Repeating this for multiple modeling decisions, you would then select the model with the best performance.

I am confused about how this works for K-nearest neighbors (KNN). In the KNN context, there are no estimated parameters and only one hyperparameter: $K$. Our objective is to select the "best" $K$, but there are no learned parameters we can use to produce a prediction for our holdout set: we can't, say, multiply the $\beta$ for a feature by the feature value and sum across all such products to obtain a prediction. Furthermore, it is possible to choose a holdout set size with $j \leq K$, for which the KNN estimator is undefined.

Is cross-validation then not applicable to KNN? I can see two ways something like cross-validation actually can be used for KNN, but these violate the principle of not validating with your training data (even the concepts are ambiguous):

  1. Partition data into smaller data sets, employ KNN on each set, calculate performance measure, then choose model based on the distribution of performance across partitions (e.g. best mean performance). This is not really cross-validation because there are no training/testing/validation sets.

  2. Generate bootstrapped data sets, repeat steps in (1).

Note: Posts on this topic (see How does k-fold cross validation fit in the context of training/validation/testing sets? and Cross Validation and Nearest Neighbors) have not specifically asked about what training/holdout sets mean for KNN and instead have received responses discussing how cross-validation works in general without regard for the lack of estimated parameters in KNN.

Best Answer

Cross validation can be applied as long as the model is predictive (i.e. $\mathbf x \mapsto y$), regardless of how that model works internally. This general applicability is one of the strong advantages of cross validation. And it also implies that there is nothing special in the cross validation of a k-nearest neighbour model compared to, say, a logistic regression model.

$k$NN produces predictions by looking at the $k$ nearest neighbours of a case $\mathbf x$ to predict its $y$, so that's fine. In particular, the $k$NN model basically consists of its training cases - but that's the cross validation procedure doesn't care about at all.

We may describe cross validation as:


loop over splits $i$ {

  1. split data into $\mathbf Z_i$ (train) and $\mathbf W_i$ (test)
  2. $model$ := training_function ($\mathbf Z_i$, further parameters)
  3. $predictions_i$ := prediction_function ($model$, $\mathbf W_i$)

}

calculate figure of merit from all $predictions$ and reference values


training_function and prediction_function are just the same training and prediction functions that are used to produce a production model.

Now, the $k$NN training_function would just store the training cases and the chosen $k$ in $model$, and the $k$NN prediction_function would then look up the $k$ (number taken from $model$) closest training cases (also taken from $model$) to the submitted case(s) and calculate the prediction from their $y$ values.

So the only thing that is a bit peculiar with $k$NN compared to many other models is that the trainings function may not be doing much "proper work" (though that is really true only for the brute force approach, there are ways to put the training data into structures that allow easier prediction and those do "proper work" during training).