Solved – Which type of data normalizing should be used with KNN

k nearest neighbourmachine learningnormalizationstandardization

I know that there is more than two type of normalizing.

For example,

1- Transforming data using a z-score or t-score. This is usually called standardization.

2- Rescaling data to have values between 0 and 1.

The question now if I need normalizing

Which type of data normalizing should be used with KNN? and Why?

Best Answer

For k-NN, I'd suggest normalizing the data between $0$ and $1$.

k-NN uses the Euclidean distance, as its means of comparing examples. To calculate the distance between two points $x_1 = (f_1^1, f_1^2, ..., f_1^M)$ and $x_2 = (f_2^1, f_2^2, ..., f_2^M)$, where $f_1^i$ is the value of the $i$-th feature of $x_1$:

$$ d(x_1, x_2) = \sqrt{(f_1^1 - f_2^1)^2 + (f_1^2 - f_2^2)^2 + ... + (f_1^M - f_2^M)^2} $$

In order for all of the features to be of equal importance when calculating the distance, the features must have the same range of values. This is only achievable through normalization.

If they were not normalized and for instance feature $f^1$ had a range of values in $[0, 1$), while $f^2$ had a range of values in $[1, 10)$. When calculating the distance, the second term would be $10$ times important than the first, leading k-NN to rely more on the second feature than the first. Normalization ensures that all features are mapped to the same range of values.

Standardization, on the other hand, does have many useful properties, but can't ensure that the features are mapped to the same range. While standardization may be best suited for other classifiers, this is not the case for k-NN or any other distance-based classifier.