Solved – How to calculate the distance in KNN for mixed data types

distancek nearest neighbourmixed type datapythonscikit learn

when the data is from different types (numerical and categorical)
of course euclidean distance alone or hamming distance alone can't help.
so i have 2 approaches:

  1. standardize all the data with min_max scaling, now all the numeric data are between [0,1] now we can use euclidean distance alone

  2. calculate the euclidean distance for numeric data and calculate hamming distance for categorical data, and then combine both distances(with weights)

my question is:
1-are my 2 approaches correct?if yes, then which is better?how can i combine the distances(choosing the weight for each feature)? is there an implementation of the second approach in sklearn in python?

Best Answer

In my opinion your first approach isn't enought because of the difference between categorical and numerical numerical. The standardisation should be maybe more appropriate but i don't have enough knowledge about it and recommand you to treat those two type separetely.

Your second proposition seems great because you use appropriate distance for each type of data and combine them to obtain a final result. There are lots to discuss about how weighted them.

I will encourage you to read this very interesting paper about categorical data where a lot of distance measure are inspect :

Similarity Measures for Categorical Data: A Comparative Evaluation

by Shyam Boriah, Varun Chandola and Vipin Kumar

http://www-users.cs.umn.edu/~sboriah/PDFs/BoriahBCK2008.pdf

It could be more preferable than Hamming depending the case.

Related Question