Solved – Why do you need to scale data in KNN

k nearest neighbour

Could someone please explain to me why you need to normalize data when using K nearest neighbors.

I've tried to look this up, but I still can't seem to understand it.

I found the following link:

https://discuss.analyticsvidhya.com/t/why-it-is-necessary-to-normalize-in-knn/2715

But in this explanation, I don't understand why a larger range in one of the features affects the predictions.

Best Answer

The k-nearest neighbor algorithm relies on majority voting based on class membership of 'k' nearest samples for a given test point. The nearness of samples is typically based on Euclidean distance.

Consider a simple two class classification problem, where a Class 1 sample is chosen (black) along with it's 10-nearest neighbors (filled green). In the first figure, data is not normalized, whereas in the second one it is.

Data without normalization Data with normalization

Notice, how without normalization, all the nearest neighbors are aligned in the direction of the axis with the smaller range, i.e. $x_1$ leading to incorrect classification.

Normalization solves this problem!