Solved – K-nearest-neighbour with continuous and binary variables

classificationk nearest neighbourr

I have a data set with columns a b c (3 attributes). a is numerical and continuous while band c are categorical each with two levels. I am using the K-Nearest Neighbors method to classify aand b on c. So, to be able to measure the distances I transform my data set by removing b and adding b.level1and b.level2. If observation i has the first level in the bcategories, b.level1[i]=1 and b.level2[i]=0.

Now I can measure distances in my new data set: a b.level1 b.level2

From a theoretical/mathematical point of view: Can you perform K-nearest neighbor (KNN) with both binary and continuous data?

I am using FNNpackage in R and the function knn()

Best Answer

It's ok combining categorical and continuous variables (features).

Somehow, there is not much theoretical ground for a method such as k-NN. The heuristic is that if two points are close to each-other (according to some distance), then they have something in common in terms of output. Maybe yes, maybe no. And it depends on the distance you use.

In your example you define a distance between two points $(a,b,c)$ and $(a',b',c')$ such as :

  • take the squared distance between $a$ and $a'$ : $(a-a')^2$
  • Add +2 if $b$ and $b'$ are different, +0 if equal (because you count a difference of 1 for each category)
  • Add +2 if $c$ and $c'$ are different, +0 is equal (same)

This corresponds to giving weights implicitly to each feature.

Note that if $a$ takes large values (like 1000, 2000...) with big variance then the weights of binary features will be negligible compared to the weight of $a$. Only the distance between $a$ and $a'$ will really matter. And the other way around : if $a$ takes small values like 0.001 : only binary features will count.

You may normalize the behaviour by reweighing: dividing each feature by its standard deviation. This applies both to continuous and binary variables. You may also provide your own preferred weights.

Note that R function kNN() does it for you : https://www.rdocumentation.org/packages/DMwR/versions/0.4.1/topics/kNN

As a first attempt, just use basically norm=true (normalization). This will avoid most non-sense that may appear when combining continuous and categorical features.