Classification – Deciding Which K to Use for KNN Model Based on Plot Analysis

classificationk nearest neighbour

I've been working with a dataset containing handwritten numbers, and to classify what number it is I've used KNN. I've made a plot comparing validation with training misclassification rate for each K = [1, 30], see below.

K-Plot
My question is the following:
The optimal model should be picked based on the lowest validation error, but in my plot there's two values of K which yields the same misclassification error. So is there a correct approach to this or is it something up to me to decide?

As I can think of it is

A) chose K = 4 as a higher K yields a less complex model.

B) Chose K = 3 as this is the K which corresponds to the turning point when the model no longer improves as we're testing on new data.

Best Answer

If you make the decision based only on the validation error, they are nearly the same. Also $k$ equal to three or four is not very different. You don't have good reasons to prefer any model, you can choose whichever. What you could do is exploratory data analysis of their predictions and the misclassified cases, to see how they differ. If they don't differ, the choice is arbitrary. I'd personally go with higher $k$ as theoretically this is less likely to overfit due to averaging over more observations.

Related Question