I have a data set with columns a b c
(3 attributes). a
is numerical and continuous while b
and c
are categorical each with two levels. I am using the K-Nearest Neighbors method to classify a
and b
on c
. So, to be able to measure the distances I transform my data set by removing b
and adding b.level1
and b.level2
. If observation i
has the first level in the b
categories, b.level1[i]=1
and b.level2[i]=0
.
Now I can measure distances in my new data set: a b.level1 b.level2
From a theoretical/mathematical point of view: Can you perform K-nearest neighbor (KNN) with both binary and continuous data?
I am using FNN
package in R and the function knn()
Best Answer
It's ok combining categorical and continuous variables (features).
Somehow, there is not much theoretical ground for a method such as k-NN. The heuristic is that if two points are close to each-other (according to some distance), then they have something in common in terms of output. Maybe yes, maybe no. And it depends on the distance you use.
In your example you define a distance between two points $(a,b,c)$ and $(a',b',c')$ such as :
This corresponds to giving weights implicitly to each feature.
Note that if $a$ takes large values (like 1000, 2000...) with big variance then the weights of binary features will be negligible compared to the weight of $a$. Only the distance between $a$ and $a'$ will really matter. And the other way around : if $a$ takes small values like 0.001 : only binary features will count.
You may normalize the behaviour by reweighing: dividing each feature by its standard deviation. This applies both to continuous and binary variables. You may also provide your own preferred weights.
Note that R function kNN() does it for you : https://www.rdocumentation.org/packages/DMwR/versions/0.4.1/topics/kNN
As a first attempt, just use basically norm=true (normalization). This will avoid most non-sense that may appear when combining continuous and categorical features.