Solved – regression with kNN on dataset with categorical variables

k nearest neighbourr

I am trying to train a regression model for dataset with 500k observations and 3 features. The features are categorical and have 50, 50 and 100 levels.

Is (generally) kNN appropriate for this kind of task?

I am using R. I tried to turn my categorical variables into dummy variables but I end up with very large and sparse data set. I am using data.matrix for conversion and it sets the matrix to double by default.

Is there a way to set it to logical instead?

Best Answer

I expect you are talking about nominal categorical variables there? Ordinal variables with 100 levels are very strange. I have never seen a likert scale with 100 nuances or anything else that would warrant a 100 level ordinal variable. If you have ordinal variables with so many levels, investigate if you can reasonably transform them into interval variables. That can be done when it is reasonable to assume the distances between any two adjacent levels are the same across the scale.

If I had only nominal categorical data, I would first look at tree based models, that's where they naturally shine. With so many options within so few categorical variables, I would expect random forests to do better than single pruned trees. You can test both though.