Optimal K Selection – KNN Model Best Practices

k nearest neighbourmodel selection

I performed a 5-fold CV to select the optimal K for KNN. And it seems like the bigger K gets, the smaller the error…

enter image description here

Sorry I didn't have a legend, but the different colors represent different trials. There are 5 total and it seems like there's little variation between them. Error always seems to decrease when K gets larger. So how can I choose the best K? Would K = 3 be a good choice here because the graph kind of levels off after K = 3?

Best Answer

If you carry on going, you will eventually end up with the CV error beginning to go up again. This is because the larger you make $k$, the more smoothing takes place, and eventually you will smooth so much that you will get a model that under-fits the data rather than over-fitting it (make $k$ big enough and the output will be constant regardless of the attribute values). I'd extend the plot until the CV error starts to go noticably up again, just to be sure, and then pick the $k$ that minimizes the CV error. The bigger you make $k$ the smoother the decision boundary and the more simple the model, so if computational expense is not an issue, I would go for a larger value of $k$ than a smaller one, if the difference in their CV errors is negligible.

If the CV error doesn't start to rise again, that probably means the attributes are not informative (at least for that distance metric) and giving constant outputs is the best that it can do.